[WLUG] Re: Atlassian outage

April 13, 2022

      ...
...
...
...
...
"Jared" == Jared Greenwald via WLUG <wlug@lists.wlug.org> writes:
Jared> At least they're owning the mistake, which is something I guess... 
Jared> https://www.atlassian.com/engineering/april-2022-outage-update

I want people to carefully reade these four paragraphs and ask
themselves, what is the difference between restoring one customer
quickly using a well tested process, and restoring 400 customers?  

    We also maintain immutable backups that are designed to be resilient
    against data corruption events, which enable recovery to a previous
    point in time. Backups are retained for 30 days, and Atlassian
    continuously tests and audits storage backups for restoration.

    Using these backups, we regularly roll-back individual customers or a
    small set of customers who accidentally delete their own data. And, if
    required, we can restore all customers at once into a new environment.

    What we have not (yet) automated is restoring a large subset of
    customers into our existing (and currently in use) environment without
    affecting any of our other customers.

    Within our cloud environment, each data store contains data from
    multiple customers. Because the data deleted in this incident was only
    a portion of data stores that are continuing to be used by other
    customers, we have to manually extract and restore individual pieces
    from our backups. Each customer site recovery is a lengthy and complex
    process, requiring internal validation and final customer verification
    when the site is restored.

These paragraphs are carefully written to imply one things, but show
that they have no system in place for proper rollbacks when an entire
customer gets purged.  I suspect that their "nuke it all from orbit"
script *also* nuked the immutable backups as well, which is why this
is all taking so long to restore.

I do hope the people who messed up are fired, and if anyone is fired
it's the managers who set up this debacle.

John

[WLUG] Re: Atlassian outage

John Stoffel