😰 Near-Miss

The CronJob That Quietly Saturated Our Kubernetes Cluster

👤 @kim-on-platformfintech10-30 engineers2024

01The Setup

This one started as a harmless cleanup task. We had moved our payments platform from Heroku to GKE about six months earlier and were still learning which safety rails Kubernetes gives you only if you ask for them. A developer added a CronJob to purge abandoned payment sessions from Redis. The job itself was simple, but it ran in the full application image, took 7 to 10 minutes on production-sized data, and nobody spent much time thinking about what would happen if it overlapped with itself.

02What Happened

The schedule was set to every minute instead of every hour. Because there was no concurrencyPolicy and no sensible resource settings, cleanup jobs piled up all weekend. By Monday we found dozens of active cleanup pods competing with the API for CPU and memory, plus a mountain of completed Job objects that made the whole namespace noisy to inspect. We never lost the control plane or "the cluster," but we absolutely made the cluster a bad place to run payment traffic for a few hours.

03Timeline

Friday 5:40 PM - Cleanup CronJob is merged and deployed Saturday morning - Jobs begin overlapping, but no alert fires Sunday night - Node memory pressure and pod restarts begin to climb Monday 9:12 AM - Deploy is stuck in Pending and the on-call starts digging Monday 9:25 AM - We spot the runaway CronJob and suspend it Monday 9:40 AM - Active jobs are drained, extra nodes are added, and service stabilizes

04The Resolution

We suspended the CronJob, deleted the active Jobs, temporarily scaled the node pool up a notch to buy breathing room, and pushed a dedicated cleanup image that started quickly and did one thing. The replacement job ran hourly, used cursor-based scans instead of a giant key sweep, set concurrencyPolicy: Forbid, and cleaned up finished Jobs automatically. After that we added namespace quotas and alerts for abnormal Job growth so the next typo pages us long before Monday standup.

LessonsWhat We Learned

If a CronJob can overlap with itself, set the concurrency policy explicitly instead of trusting defaults.

Cleanup work deserves its own small image and resource boundaries, not the full application container.

Cluster guardrails like quotas, TTL cleanup, and alerts are how you survive ordinary mistakes.

What I'd Do Differently

We reviewed application code carefully and treated YAML as paperwork. That was backwards. The bad schedule, missing concurrency policy, and missing resource settings were the entire incident.