🏗️ System Design

A Five-Person Startup Used Kubernetes for the Boring Parts and Built a Separate Control Plane for Everything Humans Waited On

👤 Crimson-Sentry-61a Early-stage startup infrastructure< 10 engineers2022

01The Setup

We were a five-person startup building a platform for browser-based applications that give each user a dedicated backend process. That meant we had two very different kinds of operational work. One set of services needed to be boring, redundant, and continuously deployable: the API server, container registry, controller, log collector, DNS services, and metrics collection. The other set of workloads was intensely interactive. We were starting processes on demand for people who were actively waiting on them. Even a minute of downtime was visible because when our product broke, our customers' products broke for their users. We did multiple deploys per day, wanted rolling deploys and high availability for the long-running control plane, and were willing to pay managed-service premiums when it saved engineering time. We ran Kubernetes through GKE, kept important state outside the cluster in managed Postgres and blob storage, and used Vercel for static sites because the opportunity cost of infrastructure work was high for a team our size.

02What Happened

Early on, we did what a lot of small infrastructure startups are tempted to do: we tried Kubernetes for almost everything. It was useful for proving out the system, including some of the ephemeral containers that sat on the customer-facing path. But we quickly found that the problems Kubernetes is designed to solve were not the same as the problems our most latency-sensitive workloads had. The mental model that finally clarified the decision for us was separating Kubernetes into the kinds of complexity it brings with it. It is a distributed control-loop framework, a container orchestrator, and a cloud-resource interface. All three are valuable when the job is to keep long-running services healthy, redundant, and declaratively configured. They are much less attractive when a human is waiting for a container to start right now. We kept running into that mismatch. Kubernetes abstracts a pool of machines into one scheduling surface, but the abstraction leaks as soon as node locality, storage, or fast inter-process communication matter. Persistent volumes can interact badly with rolling deploys. Custom resources and operators add another control loop and another layer of indirection. YAML and Helm introduce their own accidental complexity. Even when the platform is behaving correctly, there is still a gap between action and effect because you are asking a control loop to converge on a desired state rather than directly starting work. That gap was acceptable, even useful, for the boring parts of the platform. It was the wrong trade-off for interactive, session-lived compute. Our rule became simple: a human should never wait for a pod. Once we accepted that, the architecture got cleaner. We used Kubernetes for the long-running services that benefited from replicas, rolling updates, CronJobs, and declarative infrastructure. We deliberately avoided large parts of the ecosystem that would have increased the operational surface area for a tiny team: hand-written YAML, Helm, service meshes, most operators, and putting irreplaceable state inside the cluster. Instead of treating Kubernetes as the universal runtime, we narrowed it to the slice we actually needed. For the ephemeral backend processes that customers directly waited on, we built a separate orchestrator in Rust, called Plane, specifically for quickly scheduling and running those workloads. Around Kubernetes we kept the rest of the stack equally opinionated: GKE rather than self-managing the control plane, Pulumi in TypeScript rather than raw YAML, Caddy for certificate automation instead of cert-manager, managed Postgres outside the cluster, and external blob storage for durable data. The decision was not 'Kubernetes yes' or 'Kubernetes no.' It was 'Kubernetes only where its complexity matches the job.'

03The Resolution

The final design was a split architecture. Kubernetes stayed in production for the long-running pieces that benefit from replicas, rolling deploys, CronJobs, and declarative configuration: the API server, container registry, controller, log collection, DNS services, and metrics systems. Critical durable state stayed outside the cluster in managed Postgres and blob storage, which reduced the blast radius of operating mistakes in a small team setting. The latency-sensitive, per-session compute path moved to a purpose-built Rust orchestrator instead of staying on Kubernetes. There was no single outage to recover from here. The resolution was architectural: stop forcing one platform to serve two incompatible workloads, and reduce the set of Kubernetes features the team needed to understand and operate. For a five-person company deploying multiple times per day, that boundary was the reliability win.

LessonsWhat We Learned

01

In a small team, the right question is not whether Kubernetes is good or bad. It is whether its control-loop, orchestration, and cloud-abstraction complexity actually matches the workload you are putting on it.

02

If a human is waiting for a container to start, pod-scheduling latency and reconciliation behavior become product latency. Interactive workloads deserve a runtime designed for fast startup, not just robust convergence.

03

A narrow, explicit Kubernetes subset is a reliability strategy. Deciding up front that you will avoid Helm, most operators, handwritten YAML, and in-cluster durable state can remove more risk than adding another abstraction.

04

Managed services are often the correct small-team trade-off. Paying GKE, managed Postgres, or Vercel to handle undifferentiated infrastructure work can be cheaper than spending founder or early-engineer time on it.

05

Separate long-running control-plane services from customer-facing ephemeral compute. The operational guarantees and failure modes are different enough that forcing them onto one substrate creates accidental complexity.