⚡ Incident Report👤 Electric-Beacon-41a Public company infrastructure1000+ engineers2026
01The Setup
I work on cloud control-plane infrastructure that provisions virtual machines, scale sets, Kubernetes nodes, and the supporting identity and extension systems around them. One of the less glamorous but absolutely critical pieces of that path is the VM extension package storage layer. VM extensions are the small agents and bootstrap packages we use during provisioning and lifecycle operations to configure, secure, and manage machines. Those artifacts are fetched very early in the workflow, so a subset of Microsoft-managed storage accounts is intentionally configured to allow anonymous read access for platform functionality. At the same time, we run security remediation workflows designed to disable anonymous access everywhere it is not explicitly required. The assumption behind that control was straightforward: policy enforcement would target only the accounts that should be locked down, safe-deployment checks would catch bad rollout signals quickly, and any issue would stay contained to a limited region.
02What Happened
That assumption failed on February 2, 2026. At 18:03 UTC, a remediation workflow executed from a resource policy intended to disable anonymous access on Microsoft-managed storage accounts. Due to a data synchronization defect in the targeting logic, the workflow was applied to a subset of storage accounts that were intentionally configured to allow anonymous reads for VM extension artifacts. The control did exactly what it was told to do, just on the wrong assets.
The impact was immediate but not yet obvious. At 18:29 UTC, internal monitoring started showing a subset of regions with rising control-plane failures. By 19:46 UTC, we had correlated the issue across multiple regions, and by 19:55 UTC service monitoring showed failure rates above thresholds. The failure mode was brutal in its simplicity: when those storage accounts stopped serving anonymous reads, VMs and dependent services could no longer retrieve required extension packages during create, delete, reimage, scaling, and other lifecycle operations.
That broke far more than virtual machine provisioning. Azure Virtual Machines and VM Scale Sets started failing lifecycle operations. Azure Kubernetes Service node provisioning and extension installation were impacted. Azure DevOps and GitHub Actions pipelines that needed VM extensions or related packages started failing. GitHub Actions hosted runners queued and timed out waiting for capacity. Other GitHub features that use the same compute substrate, including Copilot Coding Agent, Copilot Code Review, CodeQL, Dependabot, GitHub Enterprise Importer, Pages, and Codespaces, were also hit. What should have been a targeted security remediation had turned into a global dependency outage.
The rollout was supposed to follow safe deployment practices, region by region, but a health assessment defect let the incorrect targeting state propagate broadly across public cloud regions in a short window. That was the first hard lesson of the night: if your health gate cannot see the real customer dependency, staged rollout just gives you a slower path to global failure.
At 20:10 UTC, we were fully engaged in mitigation planning. We validated a proposed fix in a test instance by 21:15 UTC, then at 21:18 UTC disabled the remediation workflow to stop it from touching additional storage accounts. At 21:50 UTC we started the broader rollback, re-enabling anonymous access region by region on the affected storage accounts. Customers began to see improvement as the extension package layer came back.
Then we triggered the second phase of the incident. As storage access was restored, blocked provisioning and lifecycle workflows resumed concurrently across infrastructure orchestrators and dependent services. A few high-volume VM extensions that used managed identity finished re-enablement in East US at 00:07 UTC on February 3. One minute later, at 00:08 UTC, customer impact increased again, first in East US and then in West US.
The managed identity backend was now the bottleneck. Automatic retries from upstream components, which were supposed to provide resilience during transient errors, amplified traffic into a storm. The cumulative load exceeded service limits in East US, and resilient routing pushed more retry traffic into West US, saturating that region as well. We had turned a storage-policy mistake into a recovery-load collapse.
At 00:14 UTC, automated alerting identified availability impact to Managed Identity services in East US and we began scaling out. The first scale-out wave finished at 00:50 UTC and did not meaningfully reduce impact because the retry backlog kept growing faster than the added capacity. A second, larger scale-out completed at 02:00 UTC with the same result. Capacity alone was not enough; we needed to control the traffic pattern.
At 03:55 UTC, we began removing traffic so the managed identity infrastructure could recover without load. Once the repaired nodes were healthy, we gradually ramped traffic back at 04:25 UTC, allowing the backlog to drain safely instead of detonating the service again. By 06:05 UTC, backlogged operations had completed and services returned to normal operating levels.
03Timeline
18:03 UTC on February 2, 2026 — Customer impact begins when the periodic remediation workflow starts disabling anonymous access on mis-targeted storage accounts.
18:29 UTC — Internal monitoring detects rising control-plane failures in a subset of regions.
19:46 UTC — Engineers correlate the issue across multiple regions.
19:55 UTC — Service monitoring confirms failure rates have crossed alert thresholds.
20:10 UTC — Incident teams begin joint mitigation planning and root-cause investigation.
21:15 UTC — Primary mitigation is validated on a test instance.
21:18 UTC — The remediation workflow is disabled to stop further storage account changes.
21:50 UTC — Region-by-region rollback begins to restore anonymous read access on impacted storage accounts.
00:07 UTC on February 3, 2026 — Re-enablement completes for a few high-volume VM extensions in East US.
00:08 UTC — Managed Identity service impact begins in East US and cascades to West US under recovery load and retries.
00:14 UTC — Alerting detects the identity-service degradation and scale-out begins.
00:50 UTC — First managed identity scale-out completes but is insufficient because retry amplification continues.
02:00 UTC — Second, larger scale-out completes and still fails to clear the backlog.
03:55 UTC — Engineers remove traffic so managed identity infrastructure can recover without load.
04:25 UTC — Traffic is gradually reintroduced after infrastructure recovers.
06:05 UTC — Backlogged operations drain and full recovery is confirmed.
04The Resolution
Root cause: a policy-driven remediation workflow intended to disable anonymous access on Microsoft-managed storage accounts was mis-targeted because of a data synchronization defect in the workflow's targeting logic. That policy was incorrectly applied to storage accounts intentionally configured for anonymous reads in the VM extension package storage layer. As a result, VM extension artifacts became unavailable and control-plane operations failed across multiple Azure services and downstream platforms, including GitHub Actions hosted runners and Codespaces.
The immediate fix was to disable the remediation workflow and restore anonymous access on the affected storage accounts region by region. That addressed the first phase of the outage, but recovery traffic then overloaded the Managed Identity backend in East US and West US due to retry amplification. Recovery required more than scaling out: engineers had to shed load, repair the infrastructure without live traffic, and then gradually reintroduce requests so the backlog could drain safely.
The initial Azure platform impact ran from 18:03 UTC on February 2 until 00:30 UTC on February 3 for most regions, with the secondary managed identity phase lasting until 06:05 UTC. GitHub reported Actions hosted runners unavailable between 18:35 UTC and 22:15 UTC, with degraded recovery until 23:10 UTC for standard runners and 00:30 UTC for larger runners. No data loss was reported. The damage was to availability and control-plane operations: VM provisioning, scaling, extension installs, identity-backed resource management, and multiple dependent developer services failed or stalled worldwide.
LessonsWhat We Learned
01Security remediations need an explicit allowlist and pre-flight diff for platform exceptions. If some storage accounts must stay publicly readable for bootstrap paths, policy automation must prove those exceptions before it is allowed to enforce.
02A staged rollout is only as safe as its health signals. If your deployment gate cannot observe customer-facing VM lifecycle failures, it cannot prevent a global blast radius.
03Recovery paths need admission control just as much as failure paths do. Releasing every blocked workflow at once turned rollback into a second outage.
04Retry logic without budgets, jitter, and coordinated load shedding can convert a transient dependency failure into a regional saturation event.
05Regional failover must preserve isolation. Redirecting retry storms from East US into West US spread the damage instead of containing it.
06Hidden dependencies such as extension artifact storage and managed identity backends need first-class observability, because customers experience them as core platform health even if they are 'internal' services.
What I'd Do Differently
I would make three changes. First, I would treat policy remediation the same way I treat a production code deploy: dry-run the exact target set, diff it against a protected inventory of exception accounts, and require customer-impact canaries that exercise real VM create, scale, and extension-install paths before the rollout can advance beyond a small region. A security workflow that can touch bootstrap dependencies should never rely on targeting data alone.
Second, I would build a recovery mode for the managed identity backend that is activated automatically when a large blocked queue begins to reopen. That mode should enforce retry budgets, cap concurrency per upstream system, and release deferred work in controlled batches instead of letting every orchestrator reconnect at once.
Third, I would rehearse rollback and cross-team engagement more aggressively with the underlying provider boundaries in mind. We lost too much time and too much safety margin to discovering during the incident which telemetry gaps mattered and which downstream systems would avalanche during recovery. Those should have been known failure modes with pre-approved traffic controls before the night started.