Atlassian outage

older
Reminder! WLUG Meeting April 14th,...

Tim Keller

April 13, 2022

2:43 p.m.

What do people of the crazy atlassian outage? They're talking about some customers still being offline for two weeks! Later, Tim. -- I am leery of the allegiances of any politician who refers to their constituents as "consumers".

Attachments:

attachment.html (text/html — 417 bytes)

Show replies by date

Jon "maddog" Hall

April 2022

2:54 p.m.

...

What do people of the crazy atlassian outage? They're talking about some customers still being offline for two weeks!

With Easter approaching I might call it "All your eggs in one basket", and no matter how strong, big and well built that basket is we never can be sure there is not an Easter Egg hiding in there, waiting for us. Although Atlassian states it was not a Cyber Security issue, rom a Cyber Security standpoint I will also point out that there are more and more foxes trying to get at the henhouse. md On Wed, Apr 13, 2022 at 10:43 AM Tim Keller via WLUG <wlug@lists.wlug.org> wrote:

...

What do people of the crazy atlassian outage? They're talking about some customers still being offline for two weeks!

Later, Tim.

-- I am leery of the allegiances of any politician who refers to their constituents as "consumers". _______________________________________________ WLUG mailing list -- wlug@lists.wlug.org To unsubscribe send an email to wlug-leave@lists.wlug.org Create Account: https://wlug.mailman3.com/accounts/signup/ Change Settings: https://wlug.mailman3.com/postorius/lists/wlug.lists.wlug.org/ Web Forum/Archive: https://wlug.mailman3.com/hyperkitty/list/wlug@lists.wlug.org/message/AWD5HW...

Tim Keller

3:24 p.m.

While they're claiming it's not a cyber crime thing, it looks like a single person was able to do this with a faulty script.. that in itself is a bit distressing. It sounds like they've got a serious process problem. Tim. On Wed, Apr 13, 2022 at 10:54 AM Jon "maddog" Hall < jon.maddog.hall@gmail.com> wrote:

...

...
What do people of the crazy atlassian outage? They're talking about some customers still being offline for two weeks!

With Easter approaching I might call it "All your eggs in one basket", and no matter how strong, big and well built that basket is we never can be sure there is not an Easter Egg hiding in there, waiting for us. Although Atlassian states it was not a Cyber Security issue, rom a Cyber Security standpoint I will also point out that there are more and more foxes trying to get at the henhouse.

md

On Wed, Apr 13, 2022 at 10:43 AM Tim Keller via WLUG <wlug@lists.wlug.org> wrote:

...
What do people of the crazy atlassian outage? They're talking about some customers still being offline for two weeks!

Later, Tim.

-- I am leery of the allegiances of any politician who refers to their constituents as "consumers". _______________________________________________ WLUG mailing list -- wlug@lists.wlug.org To unsubscribe send an email to wlug-leave@lists.wlug.org Create Account: https://wlug.mailman3.com/accounts/signup/ Change Settings: https://wlug.mailman3.com/postorius/lists/wlug.lists.wlug.org/ Web Forum/Archive: https://wlug.mailman3.com/hyperkitty/list/wlug@lists.wlug.org/message/AWD5HW...

-- I am leery of the allegiances of any politician who refers to their constituents as "consumers".

Jon "maddog" Hall

3:53 p.m.

I consider a "faulty script" as the worst type of cyber attack. As a friend of one once famously said "I will replace you with a very small shell script" They even made a T-shirt out of it. https://www.teepublic.com/t-shirt/3810999-go-away-or-i-will-replace-you-with... md On Wed, Apr 13, 2022 at 11:25 AM Tim Keller <turbofx@gmail.com> wrote:

...

While they're claiming it's not a cyber crime thing, it looks like a single person was able to do this with a faulty script.. that in itself is a bit distressing.

It sounds like they've got a serious process problem. Tim.

On Wed, Apr 13, 2022 at 10:54 AM Jon "maddog" Hall < jon.maddog.hall@gmail.com> wrote:

...
...
What do people of the crazy atlassian outage? They're talking about some customers still being offline for two weeks!

With Easter approaching I might call it "All your eggs in one basket", and no matter how strong, big and well built that basket is we never can be sure there is not an Easter Egg hiding in there, waiting for us. Although Atlassian states it was not a Cyber Security issue, rom a Cyber Security standpoint I will also point out that there are more and more foxes trying to get at the henhouse.

md

On Wed, Apr 13, 2022 at 10:43 AM Tim Keller via WLUG <wlug@lists.wlug.org> wrote:

...
What do people of the crazy atlassian outage? They're talking about some customers still being offline for two weeks!

Later, Tim.

-- I am leery of the allegiances of any politician who refers to their constituents as "consumers". _______________________________________________ WLUG mailing list -- wlug@lists.wlug.org To unsubscribe send an email to wlug-leave@lists.wlug.org Create Account: https://wlug.mailman3.com/accounts/signup/ Change Settings: https://wlug.mailman3.com/postorius/lists/wlug.lists.wlug.org/ Web Forum/Archive: https://wlug.mailman3.com/hyperkitty/list/wlug@lists.wlug.org/message/AWD5HW...

-- I am leery of the allegiances of any politician who refers to their constituents as "consumers".

Jared Greenwald

4:05 p.m.

At least they're owning the mistake, which is something I guess... https://www.atlassian.com/engineering/april-2022-outage-update On Wed, Apr 13, 2022 at 11:54 AM Jon "maddog" Hall via WLUG < wlug@lists.wlug.org> wrote:

...

I consider a "faulty script" as the worst type of cyber attack.

As a friend of one once famously said "I will replace you with a very small shell script"

They even made a T-shirt out of it.

https://www.teepublic.com/t-shirt/3810999-go-away-or-i-will-replace-you-with...

md

On Wed, Apr 13, 2022 at 11:25 AM Tim Keller <turbofx@gmail.com> wrote:

...
While they're claiming it's not a cyber crime thing, it looks like a single person was able to do this with a faulty script.. that in itself is a bit distressing.

It sounds like they've got a serious process problem. Tim.

On Wed, Apr 13, 2022 at 10:54 AM Jon "maddog" Hall < jon.maddog.hall@gmail.com> wrote:

...
...
What do people of the crazy atlassian outage? They're talking about some customers still being offline for two weeks!

With Easter approaching I might call it "All your eggs in one basket", and no matter how strong, big and well built that basket is we never can be sure there is not an Easter Egg hiding in there, waiting for us. Although Atlassian states it was not a Cyber Security issue, rom a Cyber Security standpoint I will also point out that there are more and more foxes trying to get at the henhouse.

md

On Wed, Apr 13, 2022 at 10:43 AM Tim Keller via WLUG < wlug@lists.wlug.org> wrote:

...
What do people of the crazy atlassian outage? They're talking about some customers still being offline for two weeks!

Later, Tim.

-- I am leery of the allegiances of any politician who refers to their constituents as "consumers". _______________________________________________ WLUG mailing list -- wlug@lists.wlug.org To unsubscribe send an email to wlug-leave@lists.wlug.org Create Account: https://wlug.mailman3.com/accounts/signup/ Change Settings: https://wlug.mailman3.com/postorius/lists/wlug.lists.wlug.org/ Web Forum/Archive: https://wlug.mailman3.com/hyperkitty/list/wlug@lists.wlug.org/message/AWD5HW...

-- I am leery of the allegiances of any politician who refers to their constituents as "consumers".

_______________________________________________ WLUG mailing list -- wlug@lists.wlug.org To unsubscribe send an email to wlug-leave@lists.wlug.org Create Account: https://wlug.mailman3.com/accounts/signup/ Change Settings: https://wlug.mailman3.com/postorius/lists/wlug.lists.wlug.org/ Web Forum/Archive: https://wlug.mailman3.com/hyperkitty/list/wlug@lists.wlug.org/message/26PO4E...

John Stoffel

8 p.m.

...

...
...
...
...
"Jared" == Jared Greenwald via WLUG <wlug@lists.wlug.org> writes:

Jared> At least they're owning the mistake, which is something I guess... Jared> https://www.atlassian.com/engineering/april-2022-outage-update I want people to carefully reade these four paragraphs and ask themselves, what is the difference between restoring one customer quickly using a well tested process, and restoring 400 customers? We also maintain immutable backups that are designed to be resilient against data corruption events, which enable recovery to a previous point in time. Backups are retained for 30 days, and Atlassian continuously tests and audits storage backups for restoration. Using these backups, we regularly roll-back individual customers or a small set of customers who accidentally delete their own data. And, if required, we can restore all customers at once into a new environment. What we have not (yet) automated is restoring a large subset of customers into our existing (and currently in use) environment without affecting any of our other customers. Within our cloud environment, each data store contains data from multiple customers. Because the data deleted in this incident was only a portion of data stores that are continuing to be used by other customers, we have to manually extract and restore individual pieces from our backups. Each customer site recovery is a lengthy and complex process, requiring internal validation and final customer verification when the site is restored. These paragraphs are carefully written to imply one things, but show that they have no system in place for proper rollbacks when an entire customer gets purged. I suspect that their "nuke it all from orbit" script *also* nuked the immutable backups as well, which is why this is all taking so long to restore. I do hope the people who messed up are fired, and if anyone is fired it's the managers who set up this debacle. John

Tim Keller

2:19 p.m.

It makes me think that what they're using for "backups" are actually snapshots of volumes.. But if you purge that volume, you purge the snapshots with it.. As for people getting fired..Eh, nobody inadvertently got feed into an industrial grinding machine because someone turned off all the safeties. People are just unable to look at the jira kanban boards.. Ironically the people who caused this are likely the ones most likely to fix it. I once wiped out 20TB+ of production data. I could go into all the gory details, but to say I deleted a bunch of poorly labeled volumes on a SAN. All the same I did delete the data.. and I then had to fix the mistake and part of the process was me writing up a full incident report and a RCA. This looks like it was a major breakdown in process control and testing. This is *exactly* why there are pre-prod testing environments that are cloned from real data.. so when your script that was supposed to expand your EBS volumes does so by first destroying them and then creating them larger you find that out. I'm suspecting this is going to be part of the lively conversation tonight! Later, Tim. On Wed, Apr 13, 2022 at 4:00 PM John Stoffel <john@stoffel.org> wrote:

...

...
...
...
...
...
"Jared" == Jared Greenwald via WLUG <wlug@lists.wlug.org> writes:

Jared> At least they're owning the mistake, which is something I guess... Jared> https://www.atlassian.com/engineering/april-2022-outage-update

I want people to carefully reade these four paragraphs and ask themselves, what is the difference between restoring one customer quickly using a well tested process, and restoring 400 customers?

We also maintain immutable backups that are designed to be resilient against data corruption events, which enable recovery to a previous point in time. Backups are retained for 30 days, and Atlassian continuously tests and audits storage backups for restoration.

Using these backups, we regularly roll-back individual customers or a small set of customers who accidentally delete their own data. And, if required, we can restore all customers at once into a new environment.

What we have not (yet) automated is restoring a large subset of customers into our existing (and currently in use) environment without affecting any of our other customers.

Within our cloud environment, each data store contains data from multiple customers. Because the data deleted in this incident was only a portion of data stores that are continuing to be used by other customers, we have to manually extract and restore individual pieces from our backups. Each customer site recovery is a lengthy and complex process, requiring internal validation and final customer verification when the site is restored.

These paragraphs are carefully written to imply one things, but show that they have no system in place for proper rollbacks when an entire customer gets purged. I suspect that their "nuke it all from orbit" script *also* nuked the immutable backups as well, which is why this is all taking so long to restore.

I do hope the people who messed up are fired, and if anyone is fired it's the managers who set up this debacle.

John

-- I am leery of the allegiances of any politician who refers to their constituents as "consumers".

John Stoffel

7:53 p.m.

...

...
...
...
...
"Tim" == Tim Keller via WLUG <wlug@lists.wlug.org> writes:

Tim> What do people of the crazy atlassian outage? They're talking Tim> about some customers still being offline for two weeks! Live by the cloud, die by the cloud... I think they seriously screwed up, which goes to show you the programming is 10% the work to be done, and 900% error, bounds and exception handling. All of which computers do poorly, but which we humans do reasonably well. But we're *slow* and the cloud is all about speed. In this case, it also shows that Atlassian's DR practices were crap in a major way. And if I was self hosting it internally, and knew that I was being forced to the cloud... I'd be yelling like hell at my sales rep and working to get off Atlassian products. The cloud has some advantages, and is lovely if you have a work load that scales horizontally really well, and which also bursts up in terms of performance need. But when you have a fairly static demand, it's maybe not the best thing to do. But ... the cloud sells itself on doing all those hard things well, like redundancy, resiliency, durability, etc. dual feed UPS, tested generators, reliable AC systems, etc. All the things a good Co-Lo or self hosted people handle to some level of execution. The cloud should (ideally) have all this down to a T. But ... when it doesn't, all hell breaks loose. John

1460

Age (days ago)

1461

Last active (days ago)

List overview

Download

7 comments

4 participants

participants (4)

Jared Greenwald
John Stoffel
Jon "maddog" Hall
Tim Keller

Atlassian outage

tags

participants (4)