It makes me think that what they're using for "backups" are actually snapshots of volumes.. But if you purge that volume, you purge the snapshots with it..

As for people getting fired..Eh, nobody inadvertently got feed into an industrial grinding machine because someone turned off all the safeties. 
People are just unable to look at the jira kanban boards.. Ironically the people who caused this are likely the ones most likely to fix it.

I once wiped out 20TB+ of production data. I could go into all the gory details, but to say I deleted a bunch of poorly labeled volumes on a SAN. All the same I did delete the data.. and I then had to fix the mistake and part of the process was me writing up a full incident report and a RCA.

This looks like it was a major breakdown in process control and testing. This is exactly why there are pre-prod testing environments that are cloned from real data.. so when your script that was supposed to expand your EBS volumes does so by first destroying them and then creating them larger you find that out.

I'm suspecting this is going to be part of the lively conversation tonight!
Later,
Tim.


On Wed, Apr 13, 2022 at 4:00 PM John Stoffel <john@stoffel.org> wrote:
>>>>> "Jared" == Jared Greenwald via WLUG <wlug@lists.wlug.org> writes:

Jared> At least they're owning the mistake, which is something I guess... 
Jared> https://www.atlassian.com/engineering/april-2022-outage-update

I want people to carefully reade these four paragraphs and ask
themselves, what is the difference between restoring one customer
quickly using a well tested process, and restoring 400 customers? 

    We also maintain immutable backups that are designed to be resilient
    against data corruption events, which enable recovery to a previous
    point in time. Backups are retained for 30 days, and Atlassian
    continuously tests and audits storage backups for restoration.

    Using these backups, we regularly roll-back individual customers or a
    small set of customers who accidentally delete their own data. And, if
    required, we can restore all customers at once into a new environment.

    What we have not (yet) automated is restoring a large subset of
    customers into our existing (and currently in use) environment without
    affecting any of our other customers.

    Within our cloud environment, each data store contains data from
    multiple customers. Because the data deleted in this incident was only
    a portion of data stores that are continuing to be used by other
    customers, we have to manually extract and restore individual pieces
    from our backups. Each customer site recovery is a lengthy and complex
    process, requiring internal validation and final customer verification
    when the site is restored.


These paragraphs are carefully written to imply one things, but show
that they have no system in place for proper rollbacks when an entire
customer gets purged.  I suspect that their "nuke it all from orbit"
script *also* nuked the immutable backups as well, which is why this
is all taking so long to restore.

I do hope the people who messed up are fired, and if anyone is fired
it's the managers who set up this debacle.

John


--
I am leery of the allegiances of any politician who refers to their constituents as "consumers".