🦸 Heroic Save

Untangling Terraform State Drift 30 Minutes Before a Board Demo

👤 @maya_sreinfrastructure10-30 engineers2025

01The Setup

At 1:30 PM, we had exactly the wrong kind of change window. We were an infrastructure tooling company, our production stack lived in Terraform with an S3 backend and DynamoDB locking, and the board was arriving at 2 PM for a demo of our new multi-region failover flow. One engineer was applying a change to roll out the new ECS task definition and ALB wiring for the demo build. A second engineer saw a stale-looking lock from their terminal, assumed the first run had died, force-unlocked it, and kicked off a different apply.

02What Happened

The state file was not corrupted so much as suddenly untrustworthy. After the overlapping applies, Terraform's view of AWS no longer matched what actually existed: a few resources had been created, some addresses had moved, and the next plan wanted to replace 47 production resources including parts of the app tier and database plumbing. That was the moment we stopped treating it like a deployment problem and declared an incident.

03Timeline

1:30 PM - Engineer A starts apply for the demo-related service change 1:33 PM - Engineer B force-unlocks and starts a second apply 1:35 PM - Both runs fail and the next plan shows destructive drift 1:36 PM - Terraform freeze is called in Slack 1:40 PM - Previous state version is pulled from S3 for comparison 1:48 PM - Three partially created resources are imported back into state 1:55 PM - terraform plan is clean again 2:00 PM - We do the demo on the already-running build and leave prod alone

04The Resolution

We pulled the previous state version from S3, compared it against what actually existed in AWS, and imported the three resources that had been partially created before the lock fiasco. Once terraform plan went clean again, we made the very adult decision to stop changing production and run the demo on what was already deployed. Afterward we moved all production applies behind CI, added a loud Slack announcement for every active apply, and agreed that force-unlock in production needs a second human in the room.

LessonsWhat We Learned

If Terraform says the lock is active in production, assume it is telling the truth until proven otherwise.

S3 versioning is only valuable if the team actually knows how to restore and reconcile state under pressure.

Production infrastructure changes need a narrower change window than "right before the important demo."

What I'd Do Differently

I would have separated demo-day application changes from infrastructure changes entirely. The recovery worked because the blast radius stayed small. The bigger mistake was accepting a change window where two people felt justified touching production minutes before a board meeting.