Stories/Tag

Incident Response Stories

How teams actually respond when production breaks — the timeline, the decisions, the communication, and the lessons. These are first-hand accounts from the engineers on-call.

9 stories

⚡ Incident Report

The Wrong Host Deleted Our Primary Postgres and Exposed Every Backup Assumption We Had

👤 Crimson-Beacon-87a Series B company SaaS2017

“We were a fast-growing Series B company running a hosted Git collaboration and CI platform for millions of users. In early 2017, our production database design was still painfully ...”

AzurePostgreSQLIncident ResponsePost-Mortem+2

⚡ Incident Report

How a Storage Security Policy Broke VM Provisioning Across Azure and GitHub Worldwide

👤 Electric-Beacon-41a Public company infrastructure2026

“I work on cloud control-plane infrastructure that provisions virtual machines, scale sets, Kubernetes nodes, and the supporting identity and extension systems around them. One of t...”

AzureIncident ResponsePost-MortemOn-Call+4

⚡ Incident Report

How a Database Permissions Change Doubled a Feature File and Took Down a Global CDN for Six Hours

👤 Storm-Anchor-47a Public company infrastructure2025

“We run one of the largest edge networks in the world — millions of requests per second, across hundreds of data centers in over 100 countries. Our network sits between users and th...”

NginxLinuxIncident ResponsePost-Mortem+4

⚡ Incident Report

How a Missing .npmignore Entry Leaked 512,000 Lines of Claude Code Source to the World

👤 Neon-Cinder-90a Series C+ company AI/ML2026

“We maintained the release pipeline for Claude Code, Anthropic's flagship AI coding CLI distributed as an npm package (@anthropic-ai/claude-code). The tool had grown rapidly to beco...”

Node.jsIncident ResponsePost-MortemCI/CD+2

⚡ Incident Report

The Empty DNS Record That Took Down 70 AWS Services for 14 Hours

👤 Neon-Osprey-33a Public company infrastructure2025

“We operate one of the largest cloud infrastructure platforms in the world, running hundreds of interdependent services across dozens of regions. Our DynamoDB service in us-east-1 —...”

AWSIncident ResponsePost-MortemOn-Call+3

⚡ Incident Report

Two Silent Consul Bugs That Took Down a Gaming Platform for 73 Hours

👤 Silent-Quartz-59a Public company gaming2021

“We run a gaming platform with 50 million daily active players, 18,000+ servers, and 170,000 containers. Our entire infrastructure — service discovery, container orchestration, secr...”

Incident ResponsePost-MortemOn-CallMonitoring+2

🦸 Heroic Save

Untangling Terraform State Drift 30 Minutes Before a Board Demo

👤 @maya_sreinfrastructure2025

“At 1:30 PM, we had exactly the wrong kind of change window. We were an infrastructure tooling company, our production stack lived in Terraform with an S3 backend and DynamoDB locki...”

TerraformAWSIncident ResponseCI/CD+1

🔄 Culture Change

What Changed When We Finally Put a Real On-Call Rotation in Place

👤 @sam-runs-prodSaaS2023

“For our first year, "on-call" meant the founder who happened to still have Slack open. We were a seven-person engineering team, we deployed straight from main, and we wore our lack...”

PagerDutyDatadogOn-CallIncident Response+1

⚡ Incident Report

Black Friday, One Missing Index, and 53 Minutes of Checkout Pain

👤 @sarah-oncalle-commerce2024

“I was the primary on-call engineer for a mid-size e-commerce company doing roughly 50k orders a day outside of peak season. Checkout lived in a Node.js monolith on ECS with Postgre...”

PostgreSQLAWSDatadogIncident Response+1