Monitoring Stories

You cannot fix what you cannot see. These stories explore gaps in observability, alert fatigue, dashboards that lied, and the times visibility made the difference between a near-miss and a full outage.

8 stories

Incident Report

The Wrong Host Deleted Our Primary Postgres and Exposed Every Backup Assumption We Had

👤 Crimson-Beacon-87a Series B company SaaS2017

We were a fast-growing Series B company running a hosted Git collaboration and CI platform for millions of users. In early 2017, our production database design was still painfully ...

AzurePostgreSQLIncident ResponsePost-Mortem+2
Incident Report

How a Storage Security Policy Broke VM Provisioning Across Azure and GitHub Worldwide

👤 Electric-Beacon-41a Public company infrastructure2026

I work on cloud control-plane infrastructure that provisions virtual machines, scale sets, Kubernetes nodes, and the supporting identity and extension systems around them. One of t...

AzureIncident ResponsePost-MortemOn-Call+4
Incident Report

How a Database Permissions Change Doubled a Feature File and Took Down a Global CDN for Six Hours

👤 Storm-Anchor-47a Public company infrastructure2025

We run one of the largest edge networks in the world — millions of requests per second, across hundreds of data centers in over 100 countries. Our network sits between users and th...

NginxLinuxIncident ResponsePost-Mortem+4
Incident Report

The Empty DNS Record That Took Down 70 AWS Services for 14 Hours

👤 Neon-Osprey-33a Public company infrastructure2025

We operate one of the largest cloud infrastructure platforms in the world, running hundreds of interdependent services across dozens of regions. Our DynamoDB service in us-east-1 —...

AWSIncident ResponsePost-MortemOn-Call+3
Incident Report

Two Silent Consul Bugs That Took Down a Gaming Platform for 73 Hours

👤 Silent-Quartz-59a Public company gaming2021

We run a gaming platform with 50 million daily active players, 18,000+ servers, and 170,000 containers. Our entire infrastructure — service discovery, container orchestration, secr...

Incident ResponsePost-MortemOn-CallMonitoring+2
😰 Near-Miss

The CronJob That Quietly Saturated Our Kubernetes Cluster

👤 @kim-on-platformfintech2024

This one started as a harmless cleanup task. We had moved our payments platform from Heroku to GKE about six months earlier and were still learning which safety rails Kubernetes gi...

KubernetesGCPPrometheusGrafana+2
Incident Report

Black Friday, One Missing Index, and 53 Minutes of Checkout Pain

👤 @sarah-oncalle-commerce2024

I was the primary on-call engineer for a mid-size e-commerce company doing roughly 50k orders a day outside of peak season. Checkout lived in a Node.js monolith on ECS with Postgre...

PostgreSQLAWSDatadogIncident Response+1
🏗️ System Design

How We Built a Production-Grade AWS Infrastructure from Scratch in 6 Weeks — as a Team of Two

👤 Swift-Timber-19a Early-stage startup SaaS2026

We were 14 months into building a B2B document intelligence platform for legal teams. Our entire infrastructure was a single $48/mo DigitalOcean VPS — one box, manually SSHed into,...

AWSTerraformGitHub ActionsDocker+4