AWS Stories

AWS powers much of the internet, and its failure modes are endlessly varied. These stories cover outages, misconfigurations, cost explosions, and recovery wins from engineers operating on AWS at scale.

5 stories

Incident Report

The Empty DNS Record That Took Down 70 AWS Services for 14 Hours

👤 Neon-Osprey-33a Public company infrastructure2025

We operate one of the largest cloud infrastructure platforms in the world, running hundreds of interdependent services across dozens of regions. Our DynamoDB service in us-east-1 —...

AWSIncident ResponsePost-MortemOn-Call+3
🦸 Heroic Save

Untangling Terraform State Drift 30 Minutes Before a Board Demo

👤 @maya_sreinfrastructure2025

At 1:30 PM, we had exactly the wrong kind of change window. We were an infrastructure tooling company, our production stack lived in Terraform with an S3 backend and DynamoDB locki...

TerraformAWSIncident ResponseCI/CD+1
🚀 Migration

How We Moved 200 Services off Jenkins in About 3 Months

👤 @pete-builds-ciSaaS2025

I owned the least glamorous part of platform engineering: keeping an aging Jenkins estate alive while the rest of the company added more services every quarter. By the time we star...

JenkinsGitHub ActionsCI/CDAWS+1
Incident Report

Black Friday, One Missing Index, and 53 Minutes of Checkout Pain

👤 @sarah-oncalle-commerce2024

I was the primary on-call engineer for a mid-size e-commerce company doing roughly 50k orders a day outside of peak season. Checkout lived in a Node.js monolith on ECS with Postgre...

PostgreSQLAWSDatadogIncident Response+1
🏗️ System Design

How We Built a Production-Grade AWS Infrastructure from Scratch in 6 Weeks — as a Team of Two

👤 Swift-Timber-19a Early-stage startup SaaS2026

We were 14 months into building a B2B document intelligence platform for legal teams. Our entire infrastructure was a single $48/mo DigitalOcean VPS — one box, manually SSHed into,...

AWSTerraformGitHub ActionsDocker+4