I’m not going to spend a lot of time on Amazon’s outage yesterday. There are cliches around resilience, multi-cloud architectures and the like, or there are jokes about it always being DNS – which it was – that I made with friends.
I do want to amplify an article in the Register entitled “Today is when Amazon brain drain finally caught up with AWS”. Reports indicate that over 27,000 employees have been affected by layoffs at Amazon between 2022 and 2024, leading to a significant loss of institutional knowledge. Internal documents suggest that between 69 to 81 percent of employees regret leaving the company, highlighting the challenges associated with the Return to Office initiative. As a result, the remaining teams may struggle to effectively respond to incidents, potentially leading to further outages in the future.
Quoting the article. “AWS is very, very good at infrastructure. You can tell this is a true statement by the fact that a single one of their 38 regions going down (albeit a very important region!) causes this kind of attention, as opposed to it being “just another Monday outage.” At AWS’s scale, all of their issues are complex; this isn’t going to be a simple issue that someone should have caught, just because they’ve already hit similar issues years ago and ironed out the kinks in their resilience story.
Once you reach a certain point of scale, there are no simple problems left. What’s more concerning to me is the way it seems AWS has been flailing all day trying to run this one to ground. Suddenly, I’m reminded of something I had tried very hard to forget.”
And this: “When that tribal knowledge departs, you’re left having to reinvent an awful lot of in-house expertise that didn’t want to participate in your RTO games, or play Layoff Roulette yet again this cycle. This doesn’t impact your service reliability — until one day it very much does, in spectacular fashion. I suspect that day is today.”
Why do we care?
I’m not gonna dwell on Amazon’s outage — yeah, it was DNS, we’ve all heard the jokes.
What matters is why it took so long to fix. The Register nailed it: Amazon’s been cutting people and pushing folks back to the office, and when you lose that much talent, you lose the tribal knowledge that keeps complex systems running.
AWS isn’t failing because they don’t have redundancy. They’re failing slower because the people who knew the weird edge cases — the “why this switch shouldn’t be touched” folks — are gone. That’s the real risk.
So here’s the lesson: human capital is part of your infrastructure. If AWS can stumble because their experts left, so can you. Document your systems. Cross-train your techs. Test your failovers — and don’t buy the marketing that says automation fixes everything.
When things break — and they will — what matters is how fast you can respond. And that still depends on people who know how the system really works.

