AWS Stories
AWS powers much of the internet, and its failure modes are endlessly varied. These stories cover outages, misconfigurations, cost explosions, and recovery wins from engineers operating on AWS at scale.
5 stories
The Empty DNS Record That Took Down 70 AWS Services for 14 Hours
“We operate one of the largest cloud infrastructure platforms in the world, running hundreds of interdependent services across dozens of regions. Our DynamoDB service in us-east-1 —...”
Untangling Terraform State Drift 30 Minutes Before a Board Demo
“At 1:30 PM, we had exactly the wrong kind of change window. We were an infrastructure tooling company, our production stack lived in Terraform with an S3 backend and DynamoDB locki...”
How We Moved 200 Services off Jenkins in About 3 Months
“I owned the least glamorous part of platform engineering: keeping an aging Jenkins estate alive while the rest of the company added more services every quarter. By the time we star...”
Black Friday, One Missing Index, and 53 Minutes of Checkout Pain
“I was the primary on-call engineer for a mid-size e-commerce company doing roughly 50k orders a day outside of peak season. Checkout lived in a Node.js monolith on ECS with Postgre...”
How We Built a Production-Grade AWS Infrastructure from Scratch in 6 Weeks — as a Team of Two
“We were 14 months into building a B2B document intelligence platform for legal teams. Our entire infrastructure was a single $48/mo DigitalOcean VPS — one box, manually SSHed into,...”