⚡ Incident Report👤 Storm-Anchor-47a Public company infrastructure1000+ engineers2025
01The Setup
We run one of the largest edge networks in the world — millions of requests per second, across hundreds of data centers in over 100 countries. Our network sits between users and the websites they visit, handling DNS, CDN, DDoS mitigation, bot management, serverless compute, and zero-trust access. One of our core products is a Bot Management system that uses machine learning to distinguish legitimate traffic from bots. This system relies on a "feature file" — a list of ML features and their configurations — that gets compiled from a database and distributed to every machine across our global network. The feature file is generated by a pipeline that queries a database, transforms the results into the format our ML runtime expects, and then pushes the file out globally. The runtime has a hard limit on the number of features it can load. We had never hit that limit in production. The system had been stable for years.
02What Happened
On November 18, 2025 at approximately 11:05 UTC, a routine database permissions change was applied. The change was meant to grant additional access to a service account. What nobody anticipated was that the new permissions caused the query that generates the Bot Management feature file to return duplicate rows — specifically, the response now included all metadata from an internal schema (referred to internally as the "r0 schema"), effectively more than doubling the number of rows in the output.
The pipeline dutifully compiled this bloated result into a new feature file and pushed it to every machine on our network. At 11:20 UTC, the first signs of trouble appeared in our internal monitoring: proxy processes across the network began failing. The Bot Management runtime attempted to load the feature file, found that it exceeded its hard-coded ML feature limit, and crashed. But it did not crash quietly. The proxy process that handles HTTP traffic reads the feature file at startup. When the feature file was invalid, the proxy could not initialize correctly, and HTTP requests started returning 500 errors.
By 11:28 UTC, customer-facing HTTP errors were spiking globally. Users across the internet saw error pages when trying to reach any site behind our network.
Our initial response was wrong. The symptoms — a sudden, global spike in errors across all regions simultaneously — looked like a hyper-volumetric DDoS attack. Our incident response team spent critical early minutes investigating attack vectors before ruling out external causes. This misdiagnosis cost us time.
The cascade spread beyond HTTP proxying. Our serverless compute platform (Workers), our key-value store (KV), our zero-trust access gateway, our bot verification challenge system (Turnstile), and even our internal dashboard all depend on the same proxy layer or on services that route through it. When the proxies went down, these services went down with them. Engineers trying to access the dashboard to diagnose the issue found the dashboard itself was unreachable — a circular dependency that slowed response further.
Once we correctly identified the feature file as the culprit, the fix was conceptually simple: stop the propagation of the bad file and replace it with the last known good version. But executing that fix across a global network of hundreds of data centers, while the tooling you normally use to push changes is itself partially broken, took significant coordination.
[inferred] Break-glass access procedures that bypass the normal deployment path were invoked, but some of these had their own dependencies on affected systems, adding friction to the recovery.
03Timeline
T+0 (11:05 UTC) — Database permissions change applied. Feature file generation query begins returning duplicate rows.
T+15min (11:20 UTC) — Oversized feature file propagated globally. Proxy processes begin failing across the network.
T+23min (11:28 UTC) — Customer-facing HTTP 500 errors spike. Global impact confirmed.
T+~30min — Incident response begins. Initial hypothesis: hyper-volumetric DDoS attack. Investigation focused on external attack vectors.
T+~1h — DDoS ruled out. Internal investigation pivots to configuration changes. Feature file identified as suspect.
T+~2h — Propagation of bad feature file halted. Rollback to last known good version begins.
T+3h10min (14:30 UTC) — Core HTTP traffic largely restored across the network.
T+5h46min (17:06 UTC) — Full service restoration confirmed. All dependent services (Workers, KV, Access, Turnstile, Dashboard) operational.
04The Resolution
Root cause: a database permissions change caused a feature file generation query to return duplicate rows by exposing additional schema metadata. This more than doubled the size of the Bot Management feature file. When propagated globally, the file exceeded a hard-coded ML feature count limit in the proxy runtime, causing proxy crashes and HTTP 500 errors across the entire network.
The immediate fix was stopping propagation of the oversized feature file and replacing it with the previous version. Core HTTP traffic was restored by 14:30 UTC, approximately 3 hours and 10 minutes after initial impact. Full restoration of all dependent services — including serverless compute, key-value storage, access gateway, and the internal dashboard — took until 17:06 UTC, roughly 5 hours and 46 minutes total.
No customer data was lost. The impact was availability, not integrity. An estimated 2.4 billion monthly active internet users were affected across thousands of websites and applications that route through the network. In the aftermath, the company declared "Code Orange" — a company-wide initiative called "Fail Small" that prioritized resilience engineering above all other work. The plan mandated phased rollouts for all configuration changes (not just code), modeled on the existing Health Mediated Deployment process, with staged validation and automatic rollback. It also addressed circular dependencies in incident response tooling and break-glass procedures.
LessonsWhat We Learned
01Configuration changes — not just code deploys — need staged rollouts with validation gates. A database permissions change is a configuration change, and it should have been rolled out to a canary set of machines before going global.
02Hard-coded limits in runtime systems must fail gracefully. The ML feature limit should have caused the proxy to fall back to the previous feature file, not crash entirely. Fail-open defaults preserve availability when a non-critical subsystem receives bad input.
03When your incident response dashboard depends on the system that is down, you cannot respond to the incident. Break-glass tooling and observability must have zero dependency on the production traffic path.
04Initial misdiagnosis as a DDoS attack cost critical minutes. Runbooks should include 'rule out internal config changes in the last 6 hours' as a first-pass check before assuming external attack, especially when the failure is simultaneously global.
05A single feature file propagated atomically to every machine in every data center is a global blast radius by design. Phased propagation with health checks between stages would have contained this to a small percentage of traffic.
06Circular dependencies between production systems and the tooling used to fix them are a reliability anti-pattern that only surfaces during major incidents. Map and eliminate these dependencies before you need them.
What I'd Do Differently
The most impactful change would have been treating the feature file pipeline as a deployment pipeline rather than a data pipeline. Every code deploy goes through staged rollout — canary, then 5%, then 25%, then global. The feature file was pushed atomically to 100% of machines with no intermediate health check. If we had rolled it out to a single data center first and validated proxy health for even 60 seconds, we would have caught the crash immediately and the blast radius would have been one city, not the entire internet.
Second, the hard limit in the ML runtime should have been a soft limit with a fallback. When the feature count exceeds the threshold, the runtime should log an error, load the previous known-good file from a local cache, and continue serving traffic with stale bot scores rather than crashing the entire proxy. Availability should always win over freshness for a non-security-critical ML feature set.
Third, I would have ensured that at least one path to push emergency configuration changes did not transit through our own proxy layer. The fact that engineers could not access the dashboard because the dashboard itself routed through the broken proxies is a dependency loop that should have been caught in a tabletop exercise.