Stories
13 stories from the trenches
A Five-Person Startup Used Kubernetes for the Boring Parts and Built a Separate Control Plane for Everything Humans Waited On
“We were a five-person startup building a platform for browser-based applications that give each user a dedicated backend process. That meant we had two very different kinds of oper...”
The Wrong Host Deleted Our Primary Postgres and Exposed Every Backup Assumption We Had
“We were a fast-growing Series B company running a hosted Git collaboration and CI platform for millions of users. In early 2017, our production database design was still painfully ...”
How a Storage Security Policy Broke VM Provisioning Across Azure and GitHub Worldwide
“I work on cloud control-plane infrastructure that provisions virtual machines, scale sets, Kubernetes nodes, and the supporting identity and extension systems around them. One of t...”
How a Database Permissions Change Doubled a Feature File and Took Down a Global CDN for Six Hours
“We run one of the largest edge networks in the world — millions of requests per second, across hundreds of data centers in over 100 countries. Our network sits between users and th...”
How a Missing .npmignore Entry Leaked 512,000 Lines of Claude Code Source to the World
“We maintained the release pipeline for Claude Code, Anthropic's flagship AI coding CLI distributed as an npm package (@anthropic-ai/claude-code). The tool had grown rapidly to beco...”
The Empty DNS Record That Took Down 70 AWS Services for 14 Hours
“We operate one of the largest cloud infrastructure platforms in the world, running hundreds of interdependent services across dozens of regions. Our DynamoDB service in us-east-1 —...”
Two Silent Consul Bugs That Took Down a Gaming Platform for 73 Hours
“We run a gaming platform with 50 million daily active players, 18,000+ servers, and 170,000 containers. Our entire infrastructure — service discovery, container orchestration, secr...”
Untangling Terraform State Drift 30 Minutes Before a Board Demo
“At 1:30 PM, we had exactly the wrong kind of change window. We were an infrastructure tooling company, our production stack lived in Terraform with an S3 backend and DynamoDB locki...”
What Changed When We Finally Put a Real On-Call Rotation in Place
“For our first year, "on-call" meant the founder who happened to still have Slack open. We were a seven-person engineering team, we deployed straight from main, and we wore our lack...”
The CronJob That Quietly Saturated Our Kubernetes Cluster
“This one started as a harmless cleanup task. We had moved our payments platform from Heroku to GKE about six months earlier and were still learning which safety rails Kubernetes gi...”
How We Moved 200 Services off Jenkins in About 3 Months
“I owned the least glamorous part of platform engineering: keeping an aging Jenkins estate alive while the rest of the company added more services every quarter. By the time we star...”
Black Friday, One Missing Index, and 53 Minutes of Checkout Pain
“I was the primary on-call engineer for a mid-size e-commerce company doing roughly 50k orders a day outside of peak season. Checkout lived in a Node.js monolith on ECS with Postgre...”