"Jared" == Jared Greenwald via WLUG <wlug@lists.wlug.org> writes:
Jared> At least they're owning the mistake, which is something I guess... Jared> https://www.atlassian.com/engineering/april-2022-outage-update I want people to carefully reade these four paragraphs and ask themselves, what is the difference between restoring one customer quickly using a well tested process, and restoring 400 customers? We also maintain immutable backups that are designed to be resilient against data corruption events, which enable recovery to a previous point in time. Backups are retained for 30 days, and Atlassian continuously tests and audits storage backups for restoration. Using these backups, we regularly roll-back individual customers or a small set of customers who accidentally delete their own data. And, if required, we can restore all customers at once into a new environment. What we have not (yet) automated is restoring a large subset of customers into our existing (and currently in use) environment without affecting any of our other customers. Within our cloud environment, each data store contains data from multiple customers. Because the data deleted in this incident was only a portion of data stores that are continuing to be used by other customers, we have to manually extract and restore individual pieces from our backups. Each customer site recovery is a lengthy and complex process, requiring internal validation and final customer verification when the site is restored. These paragraphs are carefully written to imply one things, but show that they have no system in place for proper rollbacks when an entire customer gets purged. I suspect that their "nuke it all from orbit" script *also* nuked the immutable backups as well, which is why this is all taking so long to restore. I do hope the people who messed up are fired, and if anyone is fired it's the managers who set up this debacle. John