Black Friday, One Missing Index, and 53 Minutes of Checkout Pain
👤 @sarah-oncalle-commerce30-100 engineers2024
01The Setup
I was the primary on-call engineer for a mid-size e-commerce company doing roughly 50k orders a day outside of peak season. Checkout lived in a Node.js monolith on ECS with PostgreSQL behind it, and most of the year that setup was boring in the best possible way. We knew Black Friday would be 5x to 8x a normal day, and we had load-tested the app tier, but we had not replayed the exact mix of promotions marketing planned to run that weekend.
02What Happened
At 6:02 AM, p99 for /api/checkout jumped from a few hundred milliseconds to double digits. Two weeks earlier we had rebuilt the promotions table during a merchandising change and missed a composite index on (product_id, valid_until). On an ordinary Tuesday the query still came back fast enough that nobody noticed. Under Black Friday concurrency, the planner fell back to a sequential scan on a hot table, request threads stacked up behind the database pool, and by 6:15 AM customers were getting spinning checkouts and timeout errors.
03Timeline
6:02 AM - Datadog fires on checkout p99 and database wait time
6:08 AM - On-call confirms the app tier is healthy but the pool is saturated
6:19 AM - pg_stat_statements points to the checkout promotions query
6:31 AM - Schema diff shows the missing composite index
6:36 AM - We disable promo stacking to cut pressure while the fix rolls out
6:40 AM - CREATE INDEX CONCURRENTLY starts
6:55 AM - Latency drops back under 200ms and the queue drains
04The Resolution
We added the missing index with CREATE INDEX CONCURRENTLY, kept promo stacking off until the build finished, and watched checkout recover within minutes. The useful follow-up was less glamorous: schema diff checks in CI, pre-event load tests using production-like discount data, and a separate alert on pool saturation so we do not wait for endpoint latency to become a customer problem before waking someone up.
LessonsWhat We Learned
01
Treat indexes as part of the feature, not as optional cleanup after the migration ships.
02
Peak-event testing needs production-like data shapes, not just more requests per second.
03
Alert on early infrastructure signals like pool saturation before customers feel the slow path.
What I'd Do Differently
We treated "migration succeeded" as the end of the work. It should have been the start of verification. If we had replayed one realistic Black Friday checkout path in staging after that schema change, this story probably never would have existed.