Why did the AWS Outage hit the bigger companies so hard?

From my understanding the AWS outage from yesterday affected Virginia. For our company if a region goes down we shift traffic to other regions as we’re running on multiple regions.

For small companies it makes sense that they don’t operate on multi-region because it adds at least the same cost again per region. For the larger companies that were affected, such as Disney and Amazon Logistics is there a notable reason or is it just cost and/or oversight?

You’d be surprised at just how janky a lot of large company networks/infrastructure is.

The bigger (and also older, as a company - with more legacy shit) you get, the more complex it gets, and sometimes what you think is a fully redundant failover process just doesn’t work - and you don’t find out until it actually happens.

I mean this is an anecdote, but in the past 10 years the company I work for has gone from say… 3 IT staff to 15. We’ve gone from maybe 1-2 tier one applications to… I don’t even know how many (many developed in-house). say 15 sites to… 60.

Nobody has the full picture in their head any more, and there’s way too much local team knowledge that doesn’t get distributed to others who said teams don’t realise need to know.

edit:
for scale I work for an annual 1.x to 2.x billion dollar mining contractor. so not sure where that fits in the big vs. small thing :smiley:

Oh… and often the bigger you get, the more complicated change becomes. just getting something changed involves steering committee meetings, change advisory board meetings, etc. Which means that fixing stuff that isn’t “broken” at the time you’re fixing it gets pushed back for the shiny new thing that some exec wants ASAP.

4 Likes

I don’t use AWS, but a friend said their company was in the process of moving to them and apparently they were set up as multi region but was still having issues. I didn’t bother to ask details as I don’t like to pry but yeah it seems like even if you were multi region it could still have effected you (probably could have forced it manually but idk what all services they used)

Could it be a “The bigger they are” effect.

As in smaller companies will get the web traffic hit, user decides it not worth the time waiting and leaves. But for bigger one like Netflix and Disney, the web traffic hits, fails and the user tires again, and again and again because edamnit they are bored and when Netflix does not work, what’s on Disney…

Could it just be volume?

1 Like

Yep, I saw that earlier today and yep, it’s understandable that different companies use the same cloud hosting provider. Louis’ issue though was that multiple aspects of his company were all hosted on AWS (evidently the same region too). It makes sense that a Magento store, phone system, etc are single-region.

Thanks, all for the replies. It seems daft that some larger corporations sound like they’re running on a single region but it does make sense that the failover isn’t smooth for the sheer volume. It was noted that Disney’s ticket sales were affected but not for how long (afaik) - if it was 7hrs that’s pretty bad but if it was a short duration that makes sense, accounting for the backup to take over.
It’s hard to wrap my head around Amazon Logistics having issues due to a single region outage on Amazon’s own cloud though.

I’ve worked in several VERY large and very well known companies where the failure of a single switch has caused a complete global outage even though the system was purportedly designed to be redundant. The most recent outage I was dragged into was a line that was flapping and the NOC guys ignored it for weeks because there was a failover to a secondary circuit. When the secondary circuit failed the contingency put in to handle the primary and secondary going down also failed because the firewall routes had not been implemented properly. Above our applications at the NOC level when they explain the sphagetti they have in-place during outage calls it can make you want to lose the will to live.

One other thing that can make things worse is resistance to make changes on the fly for fear of making things way worse. The NOC guys have to get all manner of approvals from people that don’t have a clue to make changes during an emergency.

-Vince

3 Likes

I’ve seen/experienced similar.

I had a cisco bug in ARP cause spanning tree problems on a campus network.

Yes, spanning tree was broken due to a bug in FUCKING ARP on a CISCO IOS release. Not from 1990 either. 2016. End result: the redundant loop network to account for a cable break on one path caused problems.

Sometimes, all the complexity that goes into making something clustered or fault tolerant causes more hassles than it is worth.

Another example - I have redundant internet paths. I do policy based routing by polling one of the paths to see if it is up and failover to the other if not.

Had an outage the other day on one ISP, but it was upstream from my polling IP. End result: internet broken. “But we have PBR and are multi-homed?!”.

Well yes but… not every failure mode is always considered :smiley:

Had another outage with another ISP due to them having a bit of a BGP fuckup. BGP in theory makes internet routing failover and stuff right? Well yeah… but its not exactly straightforward :slight_smile:

Sometimes keeping things simpler (and manually moving the damn cable or whatever) is best. Its certainly something you can guide the PFY*** through over the phone when you’re on holiday.

edit:
*** PFY (disambiguation) - Wikipedia

2 Likes

Even leaner, more agile companies contend with cascading failures.

During engineering, it’s entire possible to test various services against likely failures. You release the Chaos Monkey and it starts simulating failures. Your ops team write playbooks for known failure cases, and you move on to the next project.

Then AWS goes out, and you start getting weird failure modes you didn’t even think to test for. You ripped the power cord out of the authentication system to simulate a failure, but you never tested the service being responsive but slow and overwhelmed by tens of thousands of auth requests that hang rather than fail.

This causes new, unexpected failures in other microservices. Your key-value nodes can handle a power out event, but when all of those opened sessions causes memory exhaustion.

The solution to this is to design simpler systems with fewer failure modes, but engineering culture in blitzscaled companies work against this.

2 Likes

This outage affected new and changing resources only. Companies constantly recycling kubernetes pods and nodes were hit hard.

This also meant they couldn’t update DNS records to fail over if they weren’t already set up properly.

Add on to that that the global AWS console goes through US-EAST-1 and it makes for a fun day.

My company treats k8s pods and nodes much less disposably than others, so we were not able to make changes, bit we were not dead in the water.

For what its worth, this outage, I believe, was largely avoidable, at least on the scale we saw. But humans are computationally flawed even if computers are computationally perfect, and some poor meatbag is likely getting fired for this, which is sad.

4 Likes

This… coupled with a generation of new developers that are afraid of hardware and seemingly afraid of code as well, so just mash up frameworks and libraries producing cloud applications with thousands of (poorly or completely undocumented) dependencies… and all it takes is for one of those to depend on AWS and the whole thing breaks when AWS does.

1 Like

If only you knew how bad things really are

2 Likes