🔄 Culture Change

What Changed When We Finally Put a Real On-Call Rotation in Place

👤 @sam-runs-prodSaaS< 10 engineers2023

01The Setup

For our first year, "on-call" meant the founder who happened to still have Slack open. We were a seven-person engineering team, we deployed straight from main, and we wore our lack of process like it was proof we were moving fast. That mostly worked when we had a handful of customers and every problem came from someone we knew by first name. It stopped working once we started signing bigger accounts with actual uptime expectations.

02What Happened

The turning point was not a spectacular outage. It was a boring, embarrassing Friday-night connection leak that degraded one enterprise tenant for roughly four hours. Nobody noticed internally. Their ops lead emailed support more than once before we responded the next morning, and Monday's renewal call turned into a conversation about whether we were ready to support them at all. That was the moment we stopped debating whether formal on-call would slow us down.

03Timeline

Week 1 - Three senior engineers take the first rotation and write the first runbooks Month 2 - Datadog APM, basic SLOs, and a dedicated incidents Slack channel go live Month 3 - Junior engineers shadow on-call instead of carrying the pager alone Month 4 - We add a weekly stipend and formal service ownership Month 6 - MTTR is down from roughly 3 hours to under 30 minutes

04The Resolution

We started small: three senior engineers on rotation, shadow weeks for newer folks, a short incident template, and runbooks for the things that actually woke us up. Six months later, median time to acknowledge was under 10 minutes and MTTR was comfortably below half an hour. The surprise was that shipping got easier, not harder. Once people trusted that incidents would be noticed and owned, deploys stopped feeling like a bet with customer trust.

LessonsWhat We Learned

Formal on-call does not have to mean heavy process if you keep the rituals short and useful.

Start the pager with engineers who can actually resolve incidents, then add shadowing.

Service ownership matters as much as the rotation itself; otherwise every page becomes a routing problem.

What I'd Do Differently

I would have defined service ownership before introducing the rotation. Early on, pages bounced around because "the backend" belonged to everyone and therefore to no one. Once every critical system had a clear primary owner, the pager got much less chaotic.