⚡ Incident Report👤 Silent-Quartz-59a Public company gaming1000+ engineers2021
01The Setup
We run a gaming platform with 50 million daily active players, 18,000+ servers, and 170,000 containers. Our entire infrastructure — service discovery, container orchestration, secret management, session locking, and health checks — runs on a single HashiCorp stack: Consul, Nomad, and Vault. Consul is the spine everything else talks through. If Consul gets sick, everything gets sick. We'd been running this architecture for years without major incident. The week before the outage, we made two routine changes in preparation for an upcoming peak traffic event: we enabled Consul 1.10's new streaming feature on our traffic routing backend, and we increased our node count by 50%.
02What Happened
At 13:37 on a Thursday afternoon, Vault started showing performance degradation. One Consul server showed elevated CPU. Players were unaffected — at first. By 16:35 the online player count was at 50% of normal. Consul was in a cascade. We called in HashiCorp engineers.
The symptom was clear: Consul KV write latency had climbed from a normal 30–300ms to a consistent 2–3 seconds. Writes were timing out. Raft log replication was lagging. But the cause was not clear at all. Our first instinct was hardware degradation — the servers were aging. We replaced hardware. The latency did not improve. It got worse.
We spent two days on hardware. We replaced all Consul nodes with 128-core machines with faster NVMe SSDs. We reset the Consul cluster from a pre-outage snapshot and blocked all traffic with iptables to give it time to recover. Performance degraded again within hours. We reduced non-essential Consul usage, disabled health check frequency, scaled down to minimal instances. Each attempt bought hours before the same failure mode returned.
Meanwhile, our monitoring was partially blind. The telemetry systems we relied on to diagnose Consul were themselves running through Consul. Circular dependency. We were flying with gauges that read the health of the thing that made the gauges work.
On the third day we went deeper — OS-level metrics, kernel-level blocking analysis. We started seeing write contention: many goroutines blocking on the same shared Go channel in the streaming implementation. The streaming feature in Consul 1.10 used fewer channels than the old long-polling model, which should have been more efficient. But under our write load, a single shared channel became a lock bottleneck. And our new 128-core NUMA machines had made it worse — more CPU cores meant more concurrent goroutines competing for the same lock on shared memory. We had upgraded our way into deeper trouble.
At 15:51 on day three we disabled the streaming feature globally. KV write latency dropped from 2–3 seconds to 30ms within minutes. We had half the answer.
The second half was on specific Consul leader nodes that were still showing elevated latency after streaming was disabled. We dug into BoltDB — the embedded key-value store Consul uses to persist its Raft log. BoltDB marks deleted pages as free in an internal freelist but never reclaims disk space. Each time Consul performed a snapshot (which it does regularly to compact the log), it deleted old entries. Those deleted pages accumulated in the freelist, which had grown to 7.8MB containing roughly one million page IDs. Every 16kB log write now required flushing the entire 7.8MB freelist to disk. We were doing hundreds of these per second. The affected leader nodes had 4.2GB of total log storage but only 489MB of actual data — 3.8GB of dead space. We prevented those nodes from winning Raft leader elections during recovery, bypassing the bottleneck. The fix long-term was upgrading from BoltDB to bbolt, which doesn't have the freelist issue.
Recovery itself took another 12 hours. Cache systems had to be rebuilt cold. We re-admitted players via DNS steering in roughly 10% increments. Players found the DNS pattern and started sharing early-access DNS entries on social media. By 16:45 on the 31st, 73 hours after first detection, we were back at 100%.
03Timeline
T+0 (Oct 28, 13:37) — Vault performance degrades. Single Consul server shows high CPU. Players unaffected.
T+3h (16:35) — Online player count at 50% of normal. Consul cascade underway. HashiCorp engineers join.
T+12h (Oct 29, 02:00) — Hardware degradation suspected. Hardware replacement begins. KV latency does not improve.
T+29h (Oct 29, 19:00) — All Consul nodes replaced with 128-core NVMe machines. Latency worsens.
T+40h (Oct 30, 02:00) — Reset from pre-outage snapshot attempted. Fails within hours.
T+50h (Oct 30, 12:00) — OS-level analysis begins. Contention blocking on KV writes identified.
T+50h (Oct 30, 15:51) — Streaming feature disabled. KV write latency drops from 2–3s to 30ms immediately.
T+52h (Oct 30, 20:00) — BoltDB freelist pathology identified on leader nodes. Affected nodes excluded from election.
T+60h (Oct 31, 05:00) — Services restarting at correct capacity levels.
T+67h (Oct 31, 10:00) — Players re-admitted via DNS in 10% increments.
T+73h (Oct 31, 16:45) — 100% player access restored.
04The Resolution
Two independent root causes, both hidden in the Consul data layer.
Root cause #1: Consul 1.10's streaming feature introduced a shared Go channel as a lock bottleneck under high write throughput. The channel serialised writes that the old long-polling model had handled in parallel. On our NUMA multi-socket hardware, the more CPU cores we threw at it, the more goroutines competed for the same lock — which is why hardware upgrades made things worse. Fix: disable the streaming feature entirely. Latency returned to normal within minutes.
Root cause #2: BoltDB's freelist design accumulates deleted page metadata indefinitely. Consul's regular snapshotting behaviour caused the freelist to grow to 7.8MB. Every write flushed the full freelist to disk, turning a 16kB I/O into a 7.8MB I/O. Affected only certain leader nodes. Fix: prevent those nodes from winning leader election during recovery. Long-term fix: migrate to bbolt (the maintained BoltDB fork), which uses a hash-map freelist with O(1) operations instead of O(n).
No user data was lost. Recovery time from root cause identification to full player access: approximately 26 hours.
LessonsWhat We Learned
01If your monitoring infrastructure depends on the system it is monitoring, you are blind during the worst possible moments. Decouple telemetry from the critical path — run at least one observability stack that has zero dependency on your service mesh.
02Enabling a new feature and increasing capacity by 50% on the same day doubles your blast radius if either change is the cause. Separate them by at least one release cycle.
03Upgrading hardware to fix a software contention bug can make it worse. On NUMA multi-socket architectures, more cores means more goroutines competing for shared locks. Profile for contention before reaching for bigger machines.
04BoltDB's freelist is a known design flaw for write-heavy workloads with frequent snapshots. If you run Consul or any BoltDB-backed service at scale, audit your freelist size now — not during an incident. Switch to bbolt before it becomes a problem.
05A single Consul cluster as the foundation of your entire stack is a single point of failure. Partition critical services into dedicated clusters so a failure in one domain does not cascade to all others.
06Cold-start recovery is a different problem than incremental scaling. Deployment tooling designed for steady-state adjustments will fail when you need to rebuild caches from zero at 3am. Design and test cold-start runbooks explicitly.
What I'd Do Differently
We would have caught the streaming bug before it hit production if we had a canary deployment step for Consul configuration changes — roll to 5% of nodes, measure KV write latency for 24 hours, then proceed. The freelist issue we could have caught with a routine audit of BoltDB metadata size on each leader — a simple script checking freelist page count against a threshold, run weekly. Neither check existed. Both are now standard.