On-Call Stories
On-call is where processes are tested under pressure. These stories cover how teams built rotations, survived their first incidents, and changed their culture around operational readiness.
5 stories
The Wrong Host Deleted Our Primary Postgres and Exposed Every Backup Assumption We Had
“We were a fast-growing Series B company running a hosted Git collaboration and CI platform for millions of users. In early 2017, our production database design was still painfully ...”
How a Storage Security Policy Broke VM Provisioning Across Azure and GitHub Worldwide
“I work on cloud control-plane infrastructure that provisions virtual machines, scale sets, Kubernetes nodes, and the supporting identity and extension systems around them. One of t...”
The Empty DNS Record That Took Down 70 AWS Services for 14 Hours
“We operate one of the largest cloud infrastructure platforms in the world, running hundreds of interdependent services across dozens of regions. Our DynamoDB service in us-east-1 —...”
Two Silent Consul Bugs That Took Down a Gaming Platform for 73 Hours
“We run a gaming platform with 50 million daily active players, 18,000+ servers, and 170,000 containers. Our entire infrastructure — service discovery, container orchestration, secr...”
What Changed When We Finally Put a Real On-Call Rotation in Place
“For our first year, "on-call" meant the founder who happened to still have Slack open. We were a seven-person engineering team, we deployed straight from main, and we wore our lack...”