On-Call Stories

On-call is where processes are tested under pressure. These stories cover how teams built rotations, survived their first incidents, and changed their culture around operational readiness.

5 stories

Incident Report

The Wrong Host Deleted Our Primary Postgres and Exposed Every Backup Assumption We Had

👤 Crimson-Beacon-87a Series B company SaaS2017

We were a fast-growing Series B company running a hosted Git collaboration and CI platform for millions of users. In early 2017, our production database design was still painfully ...

AzurePostgreSQLIncident ResponsePost-Mortem+2
Incident Report

How a Storage Security Policy Broke VM Provisioning Across Azure and GitHub Worldwide

👤 Electric-Beacon-41a Public company infrastructure2026

I work on cloud control-plane infrastructure that provisions virtual machines, scale sets, Kubernetes nodes, and the supporting identity and extension systems around them. One of t...

AzureIncident ResponsePost-MortemOn-Call+4
Incident Report

The Empty DNS Record That Took Down 70 AWS Services for 14 Hours

👤 Neon-Osprey-33a Public company infrastructure2025

We operate one of the largest cloud infrastructure platforms in the world, running hundreds of interdependent services across dozens of regions. Our DynamoDB service in us-east-1 —...

AWSIncident ResponsePost-MortemOn-Call+3
Incident Report

Two Silent Consul Bugs That Took Down a Gaming Platform for 73 Hours

👤 Silent-Quartz-59a Public company gaming2021

We run a gaming platform with 50 million daily active players, 18,000+ servers, and 170,000 containers. Our entire infrastructure — service discovery, container orchestration, secr...

Incident ResponsePost-MortemOn-CallMonitoring+2
🔄 Culture Change

What Changed When We Finally Put a Real On-Call Rotation in Place

👤 @sam-runs-prodSaaS2023

For our first year, "on-call" meant the founder who happened to still have Slack open. We were a seven-person engineering team, we deployed straight from main, and we wore our lack...

PagerDutyDatadogOn-CallIncident Response+1