⚡ Incident Report👤 Crimson-Beacon-87a Series B company SaaS100-500 engineers2017
01The Setup
We were a fast-growing Series B company running a hosted Git collaboration and CI platform for millions of users. In early 2017, our production database design was still painfully simple: one PostgreSQL primary on `db1.cluster.gitlab.com` and one hot-standby secondary on `db2.cluster.gitlab.com`. The standby existed mostly for failover, not for read scaling or point-in-time recovery, and we were not using WAL archiving. Our safety story looked better on paper than it did in reality. We had daily `pg_dump` backups that were supposed to land in S3, daily LVM snapshots that were copied into staging so we could test against fresh production data, and Azure disk snapshots for some parts of the fleet. We were also experimenting with pgpool-II in staging because the single primary was already a known bottleneck. The problem was that most of these recovery paths had not been tested end to end under real pressure.
02What Happened
At about 17:20 UTC on January 31, 2017, one of us took a fresh LVM snapshot of production and loaded it into staging so we could do more realistic PostgreSQL load-balancing tests. That manual snapshot would later become the only reason we did not lose a full day of data.
Around 19:00 UTC, production database load started climbing. Users were having trouble posting comments on issues and merge requests. We initially treated it as an abuse and capacity problem because spam traffic was hammering the service. During the investigation we blocked abusive IPs, removed spam accounts, and also removed a user account that was using a repository as a CDN. Later we learned the load spike had a second contributor: a background job was trying to hard-delete a real employee account and all of its associated data after that account had been incorrectly flagged for abuse. Two unrelated bad inputs hit the same database at the same time.
By roughly 23:00 UTC, the secondary had fallen far enough behind that replication effectively stopped. The primary had already removed WAL segments the secondary still needed, and because we were not archiving WAL, the replica could not catch up on its own. The recovery plan was manual: wipe the secondary data directory and rebuild it with `pg_basebackup`.
That is where the incident shifted from painful to catastrophic. We deleted the secondary's PostgreSQL data directory as intended and started `pg_basebackup`, but the tool appeared to hang. Even with `--verbose`, it gave us almost nothing useful. After several attempts, it complained that `max_wal_senders` on the primary was too low. We raised `max_wal_senders` from `3` to `32`, only to have PostgreSQL refuse to restart because our `max_connections` setting of `8000` triggered semaphore issues. We dropped `max_connections` to `2000`, restarted PostgreSQL, and tried again. `pg_basebackup` still looked stuck. Running it under `strace` showed it waiting in `poll`, which told us very little.
The key detail we did not understand in the moment was that this behavior was normal: `pg_basebackup` could sit silently while waiting for the primary to start streaming data. That was not clear in our runbooks, and under pressure it looked like another failure. Around 23:27 UTC, an engineer thought leftover files in the PostgreSQL data directory might be blocking replication and ran the wipe command again, believing they were still on the secondary host. They were on the primary.
The command was stopped within a second or two, but by then around 300 GB of database data had already been removed. GitLab.com had to be taken offline immediately.
We then discovered that almost every backup assumption we had was weaker than we thought. The secondary was already unusable because we had wiped it as part of the manual resync procedure. The scheduled `pg_dump` backups were not available because they had been running with PostgreSQL 9.2 binaries against a PostgreSQL 9.6 database and failing. Worse, the cron failure notifications were being rejected because the backup emails were not DMARC-signed, so nobody knew the backups were broken. Azure disk snapshots were not enabled for the database servers because we had assumed the other backup layers were sufficient. The only recent usable copy was the manual LVM snapshot that had been copied into staging about six hours earlier.
That left us with an ugly recovery plan: copy the PostgreSQL data directory from staging back into production, restore as much missing metadata as we could, and accept that anything written after 17:20 UTC in the database would be gone. We also had to recover webhooks separately because the staging refresh process stripped them out to avoid accidental delivery from staging data. The actual copy took far longer than we wanted because staging was running on Azure classic storage without Premium disks. The bottleneck was raw disk throughput, roughly 60 Mbps. There was no clever optimization to save us; we just had to wait for slow network-backed disks to move production back into place.
Once the restored data was back on the production host, we rebuilt the service from the six-hour-old snapshot, restored webhook data from a second copy path, incremented sequences to avoid reusing IDs, and gradually brought GitLab.com back online. The service outage was public the entire time, and we kept status updates flowing through a public document, Twitter, and even a recovery livestream because there was no hiding what had happened.
The most painful part was not just the downtime. It was the data loss. We permanently lost database changes made between 17:20 UTC and roughly 23:25 UTC on January 31: roughly 5,000 projects, 5,000 comments, and about 700 new user accounts. Repositories and wikis were stored separately and were not lost, but a database-backed SaaS losing six hours of relational state is still a serious failure.
03Timeline
17:20 UTC on January 31, 2017 — A manual LVM snapshot of the production database is taken and loaded into staging for pgpool-II testing.
19:00 UTC — Database load rises sharply; users begin failing to post comments and other write-heavy actions become unstable.
23:00 UTC — PostgreSQL secondary replication falls too far behind and stops after required WAL segments are no longer available.
23:00-23:25 UTC — Engineers wipe the secondary data directory, attempt `pg_basebackup`, increase `max_wal_senders` from `3` to `32`, reduce `max_connections` from `8000` to `2000`, and continue troubleshooting the silent hang.
23:27 UTC — An engineer accidentally wipes the PostgreSQL data directory on the primary instead of the secondary; the command is stopped within seconds, but roughly 300 GB is already gone.
00:36 UTC on February 1 — Recovery begins from the staging database copy derived from the manual LVM snapshot.
17:00 UTC on February 1 — The database is restored without webhooks.
18:00 UTC on February 1 — Final restoration steps complete, webhooks are restored, and service recovery is confirmed.
04The Resolution
Root cause: we accidentally ran a destructive PostgreSQL data-directory wipe on the primary host while trying to rebuild a broken secondary replica. That operator mistake was made much more likely by weak host differentiation, incomplete replication runbooks, and a confusing `pg_basebackup` failure mode that looked like a hang. The impact became far worse because our disaster-recovery layers were not actually available: WAL archiving was not enabled, the secondary had already been wiped for re-synchronization, scheduled `pg_dump` backups had been failing because of a PostgreSQL 9.2 versus 9.6 binary mismatch, the failure alerts for those backups were silently dropped due to DMARC rejection, and Azure disk snapshots were not enabled for the database hosts.
The immediate recovery path was restoring production from the manual LVM snapshot that had been copied into staging at 17:20 UTC, then separately recovering webhooks and advancing database sequences to avoid ID reuse. The copy from staging to production took around 18 hours because the staging environment was on slow Azure classic disks. GitLab.com was restored on February 1, 2017, with core database service back by 17:00 UTC and final restoration steps completed around 18:00 UTC.
User impact was severe. GitLab.com was unavailable for many hours, and database changes made between 17:20 UTC and roughly 23:25 UTC on January 31 were permanently lost. Git repositories and wikis were not lost because they were stored separately, but the team estimated the database data loss at roughly 5,000 projects, 5,000 comments, and 700 newly created user accounts.
LessonsWhat We Learned
01A hot standby is not a backup if the repair procedure destroys it before the replacement path is proven healthy. Replica rebuild workflows need a safer staging step than 'wipe first, recover later.'
02Backups that are not continuously restore-tested and freshness-monitored do not exist. A green checkbox on a cron job is meaningless if nobody verifies a usable artifact can be restored.
03Tooling that can appear hung under normal conditions is dangerous during an incident. If `pg_basebackup` can wait silently, the runbook must say so explicitly and the operator should not have to infer that from `strace`.
04Host identity has to be impossible to miss on production shells. Clear prompts, environment markers, and standardized access flows reduce the odds of running a destructive command on the wrong machine.
05Cost-optimized recovery infrastructure becomes the bottleneck during a disaster. Slow staging disks turned a recoverable database loss into an 18-hour copy exercise.
06Abuse and spam-removal jobs should default to soft-delete or quarantine modes. Hard-deleting high-cardinality relational data during a spam event can compound an already stressed database.
What I'd Do Differently
I would change three things immediately. First, I would remove ad hoc shell-based replica rebuilds from the critical path and replace them with an automated, rehearsed procedure that verifies host identity before any destructive step runs. If a replica must be wiped, that action should require an explicit environment check and should be wrapped in a runbook that explains exactly how `pg_basebackup` behaves when it is healthy but waiting.
Second, I would treat backup validity as an operational product with a named owner. That means point-in-time recovery via WAL archiving, backup freshness metrics in Prometheus, restore drills on a fixed schedule, and alerts that do not depend on a single email path. The 9.2 versus 9.6 `pg_dump` mismatch should have been caught by automated restore tests long before an outage.
Third, I would stop pretending that a staging snapshot on slow disks is an acceptable disaster-recovery tier for a primary database. We needed recent, production-grade snapshots on the database hosts themselves and a recovery environment fast enough to restore within the business damage window, not 18 hours later.