Podo Stack

Kubelet eviction: the logic that kills the wrong pod

Ilia Gusev — Fri, 19 Jun 2026 14:01:51 GMT

The postmortem started with a question nobody could answer: why did the payment service die while the log shipper lived? The node had run out of memory, that part was clear. But the thing that got killed was the one workload we'd actually marked as important, and the thing that survived was a BestEffort DaemonSet whose entire job was to tail files and forward them. We'd spent a sprint setting requests and limits on the payment pods precisely so they'd be protected. They weren't. The log shipper had no requests at all, was using a few hundred megabytes, and sailed through the whole event untouched while the payment pod took an OOM kill from the kernel before the kubelet ever printed an eviction line.

That gap - between what we thought protected a pod and what actually decided its fate that night - is the whole subject. There are two completely separate mechanisms that can kill a pod under memory pressure, they run on different clocks, and the one that fires first is often not the one you tuned for.

Two thresholds: soft and hard

The kubelet watches the node and evicts pods before the node becomes unusable. It does this off a set of eviction signals, and you configure two families of thresholds against those signals.

Hard thresholds fire immediately. When the signal crosses the line, the kubelet picks a pod and kills it with no grace period - the pod's terminationGracePeriodSeconds is ignored, it gets effectively zero seconds to clean up. The default memory.available<100Mi is a hard threshold on most distributions. The point of hard eviction is to act fast enough that the node doesn't fall over entirely, which means it can't afford to wait for graceful shutdown.

Soft thresholds fire only after the signal has stayed over the line for a configured grace period. You set them as a pair: eviction-soft gives the threshold, eviction-soft-grace-period gives how long it has to hold before the kubelet acts. There's also eviction-max-pod-grace-period, which caps the graceful shutdown the evicted pod gets. So a soft eviction is the polite one: the signal degrades, it stays degraded for, say, ninety seconds, and only then does the kubelet start a graceful termination that itself respects a bounded grace window. The intent is to catch slow leaks before they turn into a hard-threshold emergency.

Where I've seen this go wrong is a node configured with only hard thresholds, or with a soft grace period set so long the node is already in trouble before it expires. The cluster in the postmortem ran memory.available<100Mi with no soft tier above it, so every memory event was a no-grace kill and we'd quietly given up any chance to drain a leaking pod cleanly.

The four signals, and the two that fool you

The kubelet evicts on four signals, and two of them surprise people by what they don't do.

memory.available is the one everyone knows. It's not just "free RAM" - the kubelet computes it from the cgroup's working set, deliberately excluding reclaimable page cache, so it tracks memory the kernel can't easily get back. When this drops below the threshold, the node goes MemoryPressure.

nodefs.available is free space on the filesystem the kubelet uses for volumes and pod-level scratch (logs, emptyDir).

imagefs.available is free space on the filesystem the container runtime uses for images and writable layers. On many setups these are the same disk, but they can be split, and the eviction behavior differs: low nodefs evicts pods, low imagefs first triggers image garbage collection before it starts evicting. Both also have an inodes variant (nodefs.inodesFree, imagefs.inodesFree) that catches the case where you've run out of inodes long before you've run out of bytes - a pile of tiny files will do that, and it's a genuinely confusing page when df shows free space and the node is still under disk pressure.

pid.available looks like it belongs with the others but behaves differently. When the node runs low on process IDs, the kubelet sets PIDPressure on the node, which taints it so the scheduler stops placing new pods there. It does not evict running pods to reclaim PIDs. So PID pressure is a scheduling brake, not an eviction trigger - the kubelet only evicts to reclaim memory and disk. A fork bomb in one pod will mark the node unschedulable but won't get that pod evicted by the eviction logic; you're relying on the pod's own PID limit to contain it.

How it ranks the pods

Once the kubelet decides to evict, it has to choose a victim, and the order is where our payment-service postmortem went sideways.

The first sort key is QoS class. BestEffort pods - no requests, no limits - go first. Then Burstable pods, which set requests below limits. Guaranteed pods, where requests equal limits on every container, go last. That's the design: you declare a pod important by giving it requests equal to limits, and the eviction logic honors that by killing it only as a last resort.

But QoS is the first key, not the only one. Within the consideration, the kubelet ranks by how far a pod's usage of the pressured resource exceeds its request. A Burstable pod sitting way above its memory request is a more attractive victim than one near its request, and pod priority feeds in too. The practical consequence is that a pod with no request for the pressured resource is treated as exceeding "zero" by its entire usage, which sounds bad for it, except a BestEffort pod that's barely using anything is also the cheapest thing to kill first by class. The ranking is doing something reasonable; it just isn't "kill whoever is using the most".

Which is exactly how the log shipper survived. It was BestEffort, so by class it should have died first - but the kubelet never got to run its ranking, because the kernel got there first.

The race the kubelet can lose

The kubelet's eviction loop is not the only thing that kills pods, and it's not even the fast one. The Linux kernel OOM killer runs in kernel space, on its own schedule, and it fires when a cgroup or the node hits a hard memory wall. The kubelet's eviction loop polls signals on an interval (the housekeeping period, on the order of seconds) and then has to select, signal, and wait. The kernel OOM killer fires in the time it takes a single allocation to fail.

So under a fast allocation spike - not a slow leak, a sudden burst - the kernel can OOM-kill a process inside a pod before the kubelet's next poll even notices memory.available dropped. That's what the dmesg timestamps showed us that night: the kernel's Killed process line landed almost a full second before the kubelet logged anything at all. When that happens, the kubelet didn't choose the victim. The kernel did, using its own oom_score, which Kubernetes biases by QoS (BestEffort gets a high, easy-to-kill oom_score_adj; Guaranteed gets a low one), but the kernel's accounting is per-cgroup and per-process, not "which pod is least important globally". A container that exceeds its own memory limit gets OOM-killed by its cgroup regardless of node-level pressure or its neighbors' importance.

This is why a Guaranteed pod can still die to OOM. Guaranteed protects you from kubelet eviction - it's last in that ranking. It does not protect you from your own limit. If a Guaranteed pod allocates past its memory limit, the cgroup OOM killer kills a process in that pod immediately, and the kubelet's nice QoS-aware ranking never enters the picture. In our incident the payment pod hit a request burst, its working set crossed its limit, and the cgroup OOM killer took it out while the kubelet was still mid-poll. The log shipper was never close to its (nonexistent) limit, so nothing touched it. The protection we'd built was real but aimed at the wrong killer.

Reclaim, and why the node goes NotReady

Two more mechanics close the loop on a bad memory event.

eviction-minimum-reclaim stops the kubelet from evicting one pod, dropping just barely back over the line, and then evicting again thirty seconds later when it tips back under. For each signal you can specify how much headroom to reclaim past the threshold, so a single eviction round frees enough that the node sits comfortably above the line for a while instead of flapping at the edge. Without it, a node under steady pressure can churn through pod after pod, each eviction buying only seconds.

The NotReady cascade is the failure mode that turns one hot node into a cluster event. When a node is under sustained pressure, the kubelet can get starved - it's competing for the same exhausted memory or a pegged disk - and if it can't post its status to the API server in time, the node goes NotReady. Once a node is NotReady past the eviction timeout, the control plane starts evicting (rescheduling) its pods elsewhere. Those pods land on other nodes, and if the root cause was a workload that leaks everywhere it runs, the next node starts climbing toward pressure too. One node's local problem becomes a rolling reschedule that walks across the cluster. The node-level eviction logic was trying to save one node; the control-plane reaction to NotReady can spread the load that's killing it.

Two things called "eviction"

The word "eviction" gets used for two unrelated mechanisms, and the first time that bit me it cost an afternoon of confusion.

Everything above is node-pressure eviction: the kubelet, acting locally, killing pods to save its node. It doesn't consult PodDisruptionBudgets, doesn't ask the control plane, and under hard thresholds doesn't even honor graceful termination. It's a survival reflex, and it'll violate your PDB without hesitation, because the alternative is the whole node going down.

The other one is API-initiated eviction - the Eviction API, the thing kubectl drain calls and what the control plane uses for voluntary disruptions. This one does respect PodDisruptionBudgets: if evicting a pod would take a deployment below its minAvailable, the API call is refused. It's the polite, planned path for node maintenance and autoscaler scale-down.

The expensive assumption is that a PDB protects against node pressure. It doesn't. A PDB is a contract for voluntary disruptions - drains, upgrades, scale-down. When a node is out of memory, the kubelet evicts past the PDB because there's no negotiating with a node that's about to lock up. If you need a workload to survive node pressure, the lever is QoS and limits, not a disruption budget.

Which dial to turn for which killer

Everything I'd tune here starts from which killer I'm defending against, because the two want different settings.

If a workload must not die to kubelet eviction, the move is requests equal to limits so the pod is Guaranteed, which puts it last in the kubelet's ranking. The catch we learned the hard way: that buys protection from node-pressure eviction, not from the cgroup OOM killer, so the same setting arms the other killer if you size the limit too low.

Surviving the OOM killer is a different lever entirely - the limit itself, high enough to cover real peak working set and not just the steady state, because the cgroup kills the instant you cross it. A Guaranteed pod with a too-tight limit dies more readily than a Burstable one with a generous limit. Profile the peak, don't guess it.

Then there's the node itself. Reserve resources with system-reserved and kube-reserved so the kubelet and the OS aren't fighting pods for the last megabytes. A node that hands every byte to pods is a node where the kubelet starves and goes NotReady under pressure, which is exactly how a local event turns into the cascade.

And above the hard tier, configure a soft one, so slow leaks get a graceful eviction with a real grace period before anything hits the no-mercy threshold. Hard-only configs throw away the chance to drain a pod cleanly.

The patterns that turn one hot node into an incident

These are the ones I keep running into, roughly in order of how often they catch a team:

Trusting a PDB to survive node pressure. It governs drains and voluntary disruption only; the kubelet evicts straight through it when the node is starving.
Then there's setting limits from steady-state usage instead of peak. The cgroup OOM killer fires on the peak, and a Guaranteed pod with a snug limit dies to its own cgroup while you're admiring its QoS class.
Leaving critical workloads BestEffort because "they don't use much" puts them first in line for kubelet eviction and hands the kernel a high oom_score_adj - cheap to kill on both paths.
Hard thresholds with no soft tier above them turn every memory event into a zero-grace kill, and you never get a clean drain out of a slow leak.
Forget eviction-minimum-reclaim and a pressured node flaps - evict, tip back under, evict again - churning pods for seconds of relief each.
Hand the node's entire memory to pods with nothing reserved for system and kubelet, and the kubelet starves, misses its status post, the node goes NotReady, and the control plane reschedules the leak onto the next node.
Last, assuming low imagefs and low nodefs behave the same. Low image filesystem triggers image GC first; low node filesystem goes straight to evicting pods. And both can hit on inodes while df still shows free bytes.

We left that postmortem having moved one number: the payment pod's memory limit went up by 40%, and the next request spike rode under it instead of through it. "Important" had never been a single dial. We'd set QoS thinking we'd bought protection, and we had - against the kubelet. The kernel doesn't read your QoS class the way the kubelet does, it reads your cgroup, and it acts in microseconds where the kubelet acts in seconds. Two killers, two clocks, and the disruption budget we'd also been counting on was never written to show up to either fire.

PgBouncer modes: why your pool either leaks or deadlocks

Ilia Gusev — Wed, 17 Jun 2026 14:03:15 GMT

At 02:11 the pager fired. The alert linked a dashboard that made no sense: CPU on the primary pinned at 100%, but the slow-query log was empty - not a single query over 40ms. The app dashboards showed timeouts everywhere, and pg_stat_activity had a little under five thousand rows. Almost all of them sat in idle or idle in transaction. The database wasn't slow. It was busy being five thousand processes, each holding ~10MB of private memory, fighting over the same handful of CPUs to do almost nothing. Someone had bumped the app's per-pod connection count "to handle the traffic spike", the deploy fanned out to forty pods, and forty times a generous local pool landed on a box that was happy up to about three hundred backends. We'd built a denial-of-service against ourselves with a config value.

Subscribe now

That night is the whole argument for a connection pooler in one screenshot. Postgres doesn't degrade gracefully when you over-connect it - it falls off a cliff, and the cliff is made of operating-system processes.

Why a Postgres connection is expensive

Every connection to Postgres is a full OS process. The postmaster fork()s a new backend for each one - not a thread, not a lightweight coroutine, a process with its own address space. That fork has a fixed cost (catalog cache warm-up, prepared-statement state, per-backend memory) and a standing cost that never goes away while the connection lives. Reserve roughly 5-10MB of backend-private memory per idle connection before it has executed anything interesting, and more once work_mem allocations come into play for sorts and hashes.

The standing cost is the part that bites. A thousand idle connections aren't free just because they're idle. They're a thousand entries the scheduler considers, a thousand snapshots that GetSnapshotData has to scan when a transaction takes its snapshot, a thousand slots in shared structures. Older Postgres versions had a near-linear relationship between connection count and the cost of taking a snapshot, so adding idle connections slowed down the active ones. PG 14 reworked GetSnapshotData to scale with active rather than total connections and took a lot of the sting out of that specific path, but the per-process memory and scheduler pressure are still real on every version.

The practical ceiling is lower than people expect. A box that runs eight or sixteen vCPUs is genuinely happy somewhere around a few hundred active backends, and the useful number of connections doing CPU work at once is close to the core count. Set max_connections = 5000 and you haven't bought headroom, you've bought a loaded gun. The fix isn't a bigger number. It's stopping the app from holding a backend it isn't using, which is exactly what a pooler does: a small set of long-lived server connections, fronted by a cheap front door that thousands of clients can knock on.

The three pool modes, and what each one breaks

PgBouncer is a single-process, event-driven proxy. It speaks the Postgres wire protocol, holds a pool of server connections open, and hands them to clients. The whole behavior hinges on one setting, pool_mode, and the three values trade safety for efficiency in ways that are easy to get wrong.

Session pooling is the safe default and the one that saves you the least. A client connects, PgBouncer assigns it a server connection, and the client keeps that server connection until the client disconnects. Everything a raw Postgres connection supports works, because from the server's point of view there's still one client per backend for the whole session. The catch is obvious: if your app opens a connection at startup and holds it for the pod's lifetime, session pooling gives you no multiplexing at all. You've added a hop for nothing. Session mode helps when clients connect, do a burst of work, and disconnect.

Transaction pooling is the mode everyone actually wants and the one that quietly corrupts data when used wrong. A server connection is assigned to a client only for the duration of a transaction. The moment the transaction commits or rolls back, that backend goes back into the pool and the next client's transaction can land on it. This is what lets twenty thousand clients share fifty server connections. It works because most web requests are short transactions with long idle gaps between them.

What breaks in transaction mode is everything that assumes session continuity across transactions. A server-side SET you expected to persist is gone, because the next query runs on a different backend. The same dropped continuity kills LISTEN/NOTIFY, since the channel subscription lives on a backend you no longer own. Session-level advisory locks (pg_advisory_lock) are worse than gone - they're a trap, because you acquire the lock on one backend and the unlock call may land on another, leaking the lock forever. WITH HOLD cursors that outlive a transaction don't survive either. And historically the nastiest one was protocol-level prepared statements: you PREPARE on one backend and EXECUTE on another that never saw the prepare. The dangerous part is that none of these throw a loud "you can't do this in transaction mode" error. They just do the wrong thing, intermittently, depending on which backend the pooler happened to hand you.

The prepared-statement story has actually improved. PgBouncer 1.21 (late 2023) added support for protocol-level prepared statements in transaction mode through max_prepared_statements, which defaults to 0 (off). Set it above zero and PgBouncer tracks named prepared statements per client and re-prepares them on whatever backend it routes you to, keeping an LRU cache per server connection. It's a genuinely good feature and it removed the single most common transaction-mode footgun. Two caveats worth knowing: it only covers protocol-level prepares (libpq PQprepare and the extended query protocol), not text-level PREPARE foo AS ... SQL, which PgBouncer can't see; and it does nothing for the other session features. SET, LISTEN, advisory locks, and WITH HOLD cursors are still broken in transaction mode no matter what max_prepared_statements is.

Statement pooling is the most aggressive and the rarest. The server connection is released back to the pool after every single statement, which means multi-statement transactions are simply forbidden - PgBouncer enforces autocommit and errors if you try to open a transaction. It exists for workloads that are genuinely one-shot per statement, and almost nobody runs it on purpose. If you find yourself in statement mode, it's usually because you copied a config and didn't read it.

pool_size, max_client_conn, and the backends in between

Three numbers, three different things, and the deadlock comes from conflating them.

The cheap one is max_client_conn: how many clients can be connected to PgBouncer at once. Each idle client connection inside PgBouncer costs a few kilobytes, not a backend, so this goes into the tens of thousands without trouble. It's the front door, and it's wide on purpose.

Where the cost actually lives is default_pool_size (or per-database pool_size), the number of server connections PgBouncer opens per user/database pair - each one a real Postgres backend. The whole point of the pooler is that this stays small, dozens not thousands. I usually start near the CPU core count plus a little and tune from there. The sum of all your pools across all PgBouncer instances has to comfortably fit under the database's max_connections, with room left for reserved_connections and your own superuser sessions when things go wrong at 2am.

Here's a minimal pgbouncer.ini that shows the shape of it:

[databases]
appdb = host=10.0.0.5 port=5432 dbname=appdb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction
max_client_conn = 10000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
max_prepared_statements = 200
server_idle_timeout = 60

Ten thousand clients fan into twenty-five backends per pool. The database sees twenty-five connections doing real work, not ten thousand processes sitting idle.

The deadlock hides in pool_size when a single logical operation needs more than one server connection at the same time. Say each unit of work opens a transaction on appdb, and partway through it opens a second connection to the same pool - a side query, a different ORM session, a "let me just check this other table" call - while still holding the first. Now pool_size = 25 means twenty-five units of work can each grab their first connection, and then all twenty-five sit there waiting for a second connection that will never come, because every backend in the pool is held by a transaction that's blocked waiting for a backend. Classic resource deadlock. The pool drains, cl_waiting climbs, and every request times out at once, which from the outside looks exactly like the database died. It didn't. Your concurrency model needed two backends per unit of work and the pool could only ever hand out one each.

Where to put it

Sidecar (one PgBouncer per app pod or host). Lowest latency, the pooler shares the failure domain of the app, and each instance's pool is tiny. The trap is the multiplication you already saw in the incident: if every one of forty pods runs pool_size = 25, the database sees up to a thousand backends. Sidecar pooling means dividing your global backend budget by the pod count, and that division is the thing teams forget when they scale the deployment.

Central (a dedicated PgBouncer tier in front of the database). One place to reason about the total backend count, one place to fail over, clean math: pool_size is the global pool. The cost is an extra network hop and a new thing that can fall over, so this tier wants its own redundancy. PgBouncer is single-process and single-threaded, so a busy central tier saturates one core and you scale it by running several instances behind a load balancer, each with its own slice of the pool budget.

Per-node (a PgBouncer on every Kubernetes node, app pods connect to localhost). A middle ground that bounds the multiplication to node count instead of pod count, which is usually a much smaller and more stable number. This is where a lot of larger setups land.

There's no single right answer, but there is a right discipline: write down the worst-case total backend count across every PgBouncer instance you run, and confirm it's under max_connections. If you can't compute that number from your config, you're one scale-up away from the 02:11 page.

Picking a mode without guessing

The way I land on a mode is to start from what the application actually does to a session, then take the loosest mode that doesn't break it.

For the standard web/API shape - short transactions with idle gaps, and either no session features or the few that exist routed elsewhere - transaction mode is the answer almost every time. That's where the multiplexing payoff lives, which is most of the reason to run a pooler at all.

Session mode is what I keep for the apps that genuinely lean on session continuity and can't be changed: heavy LISTEN/NOTIFY, session advisory locks held across transactions, server-side SET that has to persist, WITH HOLD cursors. That's a correctness call, not laziness. You give up multiplexing, but nothing gets silently corrupted.

There's a middle path I've used when ninety percent of traffic is transaction-mode-safe and a thin slice needs sessions: run two pools, or two PgBouncer endpoints with different pool_mode, and point the listener and advisory-lock code at the session-mode one. Mixing modes on purpose beat forcing everything into a single mode every time I tried it.

And the number that catches people - size pool_size from the database's CPU and your concurrency model, never from client count. Client count belongs to max_client_conn, which is nearly free. The expensive number tracks how many queries can usefully run at once, closer to core count than to user count. If a single request grabs two backends, effective parallelism is pool_size / 2, so you either raise pool_size with budget to back it or fix the code so one request holds one backend.

Failure modes that keep showing up

Once a pooler sits in the path, the same handful of failures keep recurring - mostly transaction-mode breakage and pool sizing. I've debugged most of these firsthand.

Session advisory locks under transaction mode. The lock gets taken on one backend and released on another, so it leaks until that backend recycles. What you see is locks that never clear and a pg_locks table slowly filling with orphans nobody can explain.
Then there's the SET that's expected to stick. You SET statement_timeout or a search_path at connect time, it passes in dev because dev only ever has one backend, and in prod it applies to a random backend you don't keep. Wrap session-scoped settings in SET LOCAL inside the transaction instead.
Pre-1.21 PgBouncer paired with an ORM that uses protocol-level prepared statements in transaction mode is its own special pain. The newer Postgres drivers prepare by default, the prepare lands on one backend, the execute on another, and you get "prepared statement does not exist" under load and never in testing. Upgrade and set max_prepared_statements, or disable prepares in the driver.
Sizing pool_size from client count is the one I see most. Someone reads ten thousand users and sets pool_size = 10000, recreating the exact problem the pooler was meant to solve. The pool is the expensive number and it stays small.
Sidecar pooling without dividing the backend budget by pod count gets people on the scale-up: every pod's pool looks reasonable in isolation, and the sum melts the primary. Multiply before you ship.
Forgetting server_idle_timeout leaves the pool holding its full pool_size of backends open forever, even at 3am when traffic is nothing. Idle server connections still cost backend memory; let the pool shrink when it's quiet.
And the assumption that PgBouncer is highly available because it's "just a proxy". It's a single process. If it dies, every client behind it loses the database at once. The pooler needs the same redundancy thinking as the database it fronts.

We never did see a single slow query that night. The database hadn't been doing too much real work - it was drowning in processes doing nothing, five thousand backends each holding their 10MB and their slot, fighting over the cores to be idle. The pooler we put in after made the backend count a number we picked on purpose, instead of one that fell out of replicas times per-pod connections. The worst-case total across every instance now lives in a comment at the top of the config, and it sits under the line.

Issue #022 - Valkey, one year on: you're probably running it already

Ilia Gusev — Tue, 16 Jun 2026 14:02:34 GMT

A teammate's Terraform plan for a new staging cache came back with engine = "valkey" where I'd have expected redis. I asked who'd changed it. Nobody had. The line wasn't in the PR description because it wasn't a change anyone made - it was the module default, and the module had moved upstream. Somewhere upstream, the thing you reach for when you want "a Redis" had quietly stopped being Redis, and nobody on our team had sat in a meeting and decided that. The default moved under us.

That's the part of the Redis-Valkey story the headline leaves out. The headline is true enough - a company relicensed, the community forked, all inside one frantic week in March 2024. But that week was a moment. The part I keep chewing on is the two years since, because Valkey didn't take over with announcements. It took over in the places nobody watches. My apt pulled it in without asking. The managed-service console reordered itself to show it first, at some point I never noticed. And the Helm chart our platform team owns turned out to have pointed at it for a year before that diff finally made me look. Not one of those shifts came with a press release - and they didn't stop when Redis, the company, reversed itself in 2025 and put Redis back under an open-source license.

What follows is about an ecosystem-scale shift that already finished happening while everyone argued about whether it would.

🏗️ Architectural Pattern: how a permissive fork becomes the default

What actually happened, with dates

On March 20, 2024, Redis dropped the BSD-3-Clause license it had carried since 2009 and moved to a dual source-available model: the Redis Source Available License v2 and SSPLv1. The first release under it was Redis 7.4. Neither license is OSI-approved, and that's not a technicality - it's the whole story. Source-available means you can read the code and run it, but the terms carve out exactly the thing the hyperscalers do: offer it as a managed service.

Eight days later, on March 28, the Linux Foundation announced Valkey, forked from Redis 7.2.4 - the last commit under BSD. The founding sponsors were the companies with the most to lose from a source-available Redis: AWS, Google Cloud, Oracle, Ericsson, Snap. Madelyn Olson, a longtime Redis core maintainer from AWS, became one of the project leads. One small correction to a thing I see repeated everywhere, including in my own notes from when I first filed this topic: Valkey is a Linux Foundation project, not a CNCF one. It never went into the Sandbox. I had it wrong for months.

So far this reads like every other license-flip fork - OpenSearch out of Elasticsearch, OpenTofu out of Terraform. The pattern rhymes. But the reason this fork stuck has less to do with community sentiment than with one word in the old license.

Why "permissive" was the load-bearing word

BSD is a permissive license. It lets you take the code, build a billion-dollar managed service on top, and never contribute a line back. That permissiveness is exactly what Redis Inc. was trying to end - and it's exactly what AWS, Google, and Oracle needed to keep their ElastiCache and Memorystore businesses running without paying a toll.

When those three put their weight behind Valkey, they weren't being generous. A BSD fork was the only outcome that preserved their existing business model. SSPL would have forced them to open-source their entire service stack; RSAL would have forced a commercial deal. A fork off the last BSD commit kept the permissive grant alive. The hyperscalers funded the fork because the fork was cheaper than either alternative the new license offered them.

That's why this fork had something OpenTofu and OpenSearch had to fight harder for: three of the largest infrastructure vendors on earth were structurally motivated to make it the default in their own products on day one. And the place a fork becomes a default isn't GitHub stars. It's the package index and the managed-service console.

The defaults fell like dominoes

Watch where Valkey landed, and how fast.

The Linux distributions went first, because a distro maintainer's whole job is "ship the thing with a real open-source license." Fedora 41 shipped in October 2024 with Valkey replacing Redis outright - the package literally carries Obsoletes: redis, so dnf upgrade swaps you over. Debian 13 "trixie" ships valkey-server in main. Arch replaced Redis in its [extra] repo in April 2025. Ubuntu pulled it into the archive the same autumn. If you apt install redis on a current distro, there's a real chance you're getting a compatibility shim that pulls Valkey.

The managed services went next. AWS launched ElastiCache for Valkey in October 2024 and priced it to move - the serverless tier landed about 33% cheaper than the Redis-OSS equivalent, node-based up to 20% cheaper, with a serverless floor around $6 a month. MemoryDB for Valkey came the same day at roughly 30% under the Redis option. Google's Memorystore for Valkey went GA in April 2025 and had Valkey 9.0 by March 2026. Oracle, Aiven, DigitalOcean all shipped Valkey tiers.

One thing the tidy version of this story gets wrong, and I want to be precise because I almost printed it myself: it is not all three big clouds. Azure did not switch. Azure Managed Redis runs Redis Enterprise software under a commercial arrangement with Redis Inc., and Valkey on Azure only exists if you run it yourself on AKS. So the line is "AWS and Google made Valkey the cheap default, Azure stayed with Redis." Anyone who tells you the hyperscalers unanimously defaulted to Valkey is rounding off a third of the market.

But two out of three, plus the distros, plus the price cut, is how a default moves without a decision. That Terraform module didn't pick Valkey because someone believed in open governance. It picked Valkey because the upstream provider made it the cheaper, first-listed option, and defaults are sticky in exactly the direction the vendor points them.

🆚 Showdown: Valkey vs Redis, one year of divergence

For the first six months, "Valkey vs Redis" was a non-question - they were the same codebase with different logos. That's not true anymore. Both projects shipped real, divergent engineering through 2025 and into 2026, and the gap is now wide enough that you're choosing between two products, not two brands on one binary.

Where Valkey spent its year

Valkey 8.0 landed in September 2024 with the work the AWS-heavy contributor base cared about most: I/O threading reworked to push more off the main thread, and a memory layout change in cluster mode that trimmed roughly 24 bytes per key - about a 20% reduction in per-node memory on real workloads. On a c7g.4xlarge that pushed single-node throughput to around 1.19 million requests per second, and the project made "one million RPS" the headline because for a Redis-shaped thing that number used to require sharding.

Then 8.1 in March 2025 added an experimental RDMA transport and a more memory-efficient hash table. Version 9.0 in October 2025 brought atomic slot migration - resharding a cluster without the window where keys could be served by the wrong node - and hash-field TTLs (HEXPIRE), which people had wanted for a decade. The current release, 9.1 from May 2026, redesigned the threading again for another throughput bump and added per-database ACLs.

The module question matters too, because the old Redis Stack modules - JSON, search, probabilistic structures - were never BSD, so the fork couldn't take them. Valkey built replacements: valkey-search went GA in May 2025 with HNSW approximate-nearest-neighbor and exact KNN for vector workloads, valkey-bloom and valkey-json shipped around the same time, and they're bundled together now. Not as mature as Redis's decade of Stack, but no longer a blank space.

Where Redis spent its year, including a plot twist

Here's the part the "Valkey won" narrative tends to skip. On May 1, 2025, Redis 8 shipped with AGPLv3 added as a license option. AGPL is OSI-approved. Redis, the product, is open-source software again - tri-licensed now, but you can take the AGPL grant and you're fully in open-source territory. Salvatore Sanfilippo, antirez, the original author, had rejoined Redis in late 2024 and contributed Vector Sets to that release.

And Redis 8 is a genuinely strong release. It folded the formerly-separate Stack modules - JSON, time series, the query engine with vector and full-text search, the new Vector Sets - into the open-source core. If your reason for using Redis was the rich data-structure surface and the AI-adjacent feature set, Redis still has the deeper, more integrated version of that. The company kept the brand, the docs everyone Googles, the Redis Cloud business, and the most mature module ecosystem.

So who actually won what

They won different things, and pretending otherwise is how you make a bad architecture call.

Valkey owns the substrate. The distro default, the cheap managed tier on AWS and Google, the self-hosted "I just need a fast cache and I don't want to think about licensing" case. If your relationship with Redis was "it's the thing apt installs and ElastiCache runs," that thing is Valkey now, and the migration is free.

Redis owns the product surface. The integrated search and vector and JSON story, the commercial support, the brand recognition that gets it through a procurement review without a fight, the feature velocity from a funded company with the original author back in the building.

The mistake is treating it as one contested codebase where a winner takes all. It's two projects that started identical and are walking apart - same wire protocol, increasingly different ambitions. A year from now "should we use Redis or Valkey" will feel as coherent a question as "should we use MariaDB or MySQL," which is to say: it depends entirely on which lane you're standing in.

🛠️ The practical part: migrating is a non-event, the operator isn't

The good news is almost suspiciously good. Valkey forked from Redis 7.2.4, so for the overwhelming majority of deployments the migration is "change the binary, keep everything else." RESP2 and RESP3 are identical. RDB files load straight across. Replication between a Redis primary and a Valkey replica works, which means you can cut over with the same zero-downtime dance you'd use for a Redis version bump: stand up Valkey as a replica, let it sync, promote it, retire the old primary.

Your client library almost certainly doesn't care either. Jedis, Lettuce, ioredis, redis-py, go-redis, StackExchange.Redis - they speak the wire protocol, and the wire protocol didn't change. I migrated a small internal service as a test before writing this, and the only diff in the application was the connection hostname. There's also Valkey GLIDE now, an officially-backed client with a Rust core and language bindings, if you want something the project itself maintains - but you are under no pressure to switch off the client you have.

So if it's that easy, where's the catch? It's one layer up, in how you run it on Kubernetes.

The operator gap is real

If you run Redis on Kubernetes through an operator, do not assume there's a clean, mature, Valkey-native equivalent waiting. There isn't, quite, yet. The official valkey-operator from the Valkey project is still pre-1.0 and labeled not-for-production, and I'd take that label seriously. The most capable Valkey-native operator today is the third-party hyperspike/valkey-operator, which is solid but also still pre-1.0 with a few hundred stars - fine for a team that reads the code, riskier as a blind dependency.

The pragmatic move, and the one I've seen hold up, is to not chase a Valkey-branded operator at all. The mature Redis operators - the ones from OT-Container-Kit and Spotahome - drive the engine over the same wire protocol, so they run Valkey perfectly well even though they say "redis" on the tin. Point the existing operator at a Valkey image and it works. You get the maturity of a battle-tested operator and the engine you want, and you wait for the Valkey-native operators to grow up before betting a fleet on them.

One more sharp edge from this year: Bitnami changed its container-image and chart terms in August 2025, which broke a lot of "just use the Bitnami chart" muscle memory across both Redis and Valkey. The Valkey project responded with an official Helm chart in January 2026. If your platform inherited a Bitnami Valkey chart and it started behaving strangely late last year, that's why - move to the official chart.

The decision, compressed

New deployment, you control the stack, you want a cache or a data structure server without the licensing question: Valkey, default, no real argument. New deployment that leans hard on Redis's integrated search, vector, or JSON-and-query surface and you'd otherwise be wiring those up by hand: Redis 8 earns its place. Existing Redis older than 7.4 that you're happy with: there's no fire, but the next time you'd do a major version bump anyway, that's your free, low-risk moment to land on Valkey instead - because the managed tier and the distro package are already heading there without you.

🔥 Hot Take: Redis is open source again and it changed nothing

When Redis 8 added AGPL in May 2025, a reasonable person could have called the whole thing over. The original grievance was "Redis stopped being open source." Redis is open source again. Case closed, everyone go home, undo the fork.

That's not what happened, and the reason it didn't is worth sitting with, because it's the actual lesson of the last two years.

First, the technical one: AGPL is copyleft, BSD is permissive, and for the parties that funded Valkey those are not interchangeable. AGPL's network clause is precisely the obligation a hyperscaler building a closed managed service wants to avoid - it's a milder cousin of the SSPL that started the fight. Redis going AGPL gave individual developers their OSI checkbox back, but it gave AWS and Google nothing they could build a business on the way BSD did. The people who made Valkey the default didn't get their problem solved by Redis 8. So they kept their fork, and the fork they'd already wired into ElastiCache and Memorystore kept being the default.

Second, the one that doesn't show up in a license comparison table: trust doesn't round-trip. A project that relicensed once, against the wishes of much of its contributor base, to capture revenue, has demonstrated it will do that. Adding an open license back doesn't restore the thing that broke - it just proves the license is a lever the company is willing to pull. Once a community has watched the rug move, "we put it back" is not the same as "the rug was never moveable." The Linux Foundation's pitch for Valkey is governance you can't relicense on a whim, and that pitch got stronger when Redis demonstrated relicensing-on-a-whim is a thing that happens.

I'll add a caution in the other direction, because the pro-Valkey camp overclaims too. You've probably seen the stat that "83% of large enterprises have adopted or are exploring Valkey." Don't repeat it as a migration rate. It's a vendor survey, it predates Redis going AGPL, and "adopted or exploring" bundles a production cutover with someone running it once in a sandbox. The honest, boring truth is the one I opened with: the defaults moved, so adoption is happening through inertia more than conviction, in the lanes where the upstream made Valkey the cheap first option.

The reframe I'd offer: this stopped being a fight and became a fork in the road, in the literal sense. Two roads, both paved, going to different places. Redis is a well-funded company building a rich data platform with the original author back at the helm. Valkey is an infrastructure commodity governed so it can't be enclosed again, riding inside the clouds and distros that fund it. The "war" framing wants a loser. There isn't one. There's a substrate and there's a product, and most of us are quietly running the substrate without having chosen it - which, when you think about what infrastructure is supposed to be, might be the most complete kind of winning there is.

Until next week

The thing that stuck with me writing this: the most consequential infrastructure decision of the last two years, for a huge number of teams, was made by nobody on those teams. It got made in a module default, a distro Obsoletes line, a managed-service console that listed the cheaper option first. That's worth a paranoid afternoon - go check what your caches actually run right now, because the answer may have changed without a ticket.

Next Tuesday we stay in the land of costs you didn't sign off on: namespaces. They look free. At scale they are extremely not, and one team found 7 TiB of memory hiding in the gap. See you then.

- Ilia

Postgres jsonb: when documents beat columns, and when they don't

Ilia Gusev — Fri, 12 Jun 2026 14:00:36 GMT

Jsonb feels free. You throw the shape-shifting payload into a column, query it later, ship the feature, move on. Most teams pay for that feeling around month 18. The cost shows up in three places at once - write amplification where every partial update rewrites the entire document, GIN indexes nowhere near as cheap as they look on paper, and schema drift you can't query against because the shape only exists in the heads of whoever last touched the writer. The decision isn't "columns vs jsonb". It's "what does the access pattern look like, and which trade-off do you want to pay?"

The free-feeling column that isn't

The pitch is irresistible. Variable shape, no migrations, the planner handles it. A startup with five product directions a quarter, an event-ingest table where every source has different fields, an admin metadata column that grows a new field every release - all reach for jsonb and feel like they got away with something. Two years later, the same three failure modes show up on every one of them.

First, write amplification on partial updates. Postgres has no concept of updating a field inside a jsonb document. jsonb_set of one nested key is the same code path as overwriting the whole thing - row read, new version materialised in memory, brand-new tuple written. On a 5 KB document where you changed one boolean, you wrote 5 KB of new tuple, marked the old one dead, handed 5 KB of bloat to autovacuum. Multiply by a million updates a day and you're back in the disk-graph nightmare Evergreen #4 covered, except now the cause is your data model.

Second, query plan disasters when nobody indexed for the access pattern. WHERE doc @> '{"status":"active"}' against ten million rows without a GIN index is a sequential scan that parses every jsonb document on every page. The query took 4 ms on staging and takes 14 seconds in production. The team's first instinct is to "add an index" - but jsonb indexing has three flavours for three different operator families, and the wrong one is one the planner won't use.

Third, schema drift you can't query against. Six months in, half the rows have status, the other half have state, values are sometimes strings, sometimes booleans, sometimes the string "true". A check constraint would have caught it on day one. The jsonb column silently accepted everything. When analytics asks "how many active users", the honest answer is "we don't know, the field is spelled three ways". This isn't a Postgres problem - it's the absence of one. Jsonb doesn't push back on you, and that's exactly the failure.

What jsonb actually stores

jsonb isn't the JSON text. It's a binary tree representation of the parsed document, in a Postgres-specific format. On insert, Postgres parses the JSON, validates it, writes the tree. On read it walks the tree directly - no re-parsing. That's why jsonb is faster to query than json (the other type, which keeps the original text and re-parses on every access) and why storage is slightly higher. You pay parse time once at write, get cheap reads forever. For anything you'll query, jsonb is the right type.

Then there's TOAST. Postgres rows live in 8 KB pages, and anything bigger gets pushed out-of-line into the TOAST table for that relation, compressed, reassembled on read. A 5 MB jsonb document doesn't live in your events table - it lives in pg_toast_, with the main row holding a pointer. SELECT doc FROM events WHERE id = 42 follows the pointer, reads the chunks, decompresses, hands you back the document. A lot of I/O for one row.

TOAST is what makes the atomic-write story so brutal. A partial update can't update part of the document - the document is a single immutable value stored as TOAST chunks. Postgres reads the whole thing into memory, applies your change to the new version, writes a fresh set of chunks, updates the main-row pointer, marks the old tuple dead. A 200-byte field flipped inside a 5 MB document means 5 MB of new TOAST writes plus WAL traffic plus future vacuum work. The statement looks like one line and behaves like a full-document rewrite, every time.

One subtle detail: inserting a document and reading it back doesn't give you exactly what you sent. Keys get reordered, whitespace disappears, duplicate keys are silently de-duplicated keeping the last, numeric values are normalised. If your app round-trips jsonb expecting byte-equivalence, it won't get it.

Indexing jsonb: jsonb_ops vs jsonb_path_ops

GIN is the workhorse for jsonb. It's a generalised inverted index - builds a posting list for each indexable item and lets the planner intersect those lists at query time. The two flavours differ on what counts as "an indexable item".

jsonb_ops is the default. It indexes keys and values separately - every key path and every leaf value gets its own posting list. A wide set of operators work against it: containment @>, key existence ?, key-in-array ?| and ?&, plus JSON path operators on newer Postgres. The cost is index size - on documents with many distinct keys and values, the index can grow larger than the table itself.

jsonb_path_ops is the slimmer cousin. It hashes the entire path from root to each leaf value into a single token and indexes only the hashes. The index is typically half the size. The trade-off is that key-existence operators stop working - it answers @> containment but not "does this document have a top-level key called status". For most production workloads the only operator that matters is @>, which is exactly what jsonb_path_ops is optimised for.

EXPLAIN (ANALYZE, BUFFERS)
SELECT id FROM events
WHERE doc @> '{"action":"login","tenant":"acme"}';

 Seq Scan on events  (cost=0.00..184320.00 rows=120 width=8)
                    (actual time=0.041..3924.882 rows=118 loops=1)
   Filter: (doc @> '{"action": "login", "tenant": "acme"}'::jsonb)
   Rows Removed by Filter: 9999882
   Buffers: shared hit=21 read=141204
 Execution Time: 3924.940 ms

Same query after a GIN index with jsonb_path_ops:

CREATE INDEX ix_events_doc ON events USING gin (doc jsonb_path_ops);

EXPLAIN (ANALYZE, BUFFERS)
SELECT id FROM events
WHERE doc @> '{"action":"login","tenant":"acme"}';

 Bitmap Heap Scan on events  (cost=24.50..425.18 rows=120 width=8)
                            (actual time=0.612..1.118 rows=118 loops=1)
   Recheck Cond: (doc @> '{"action": "login", "tenant": "acme"}'::jsonb)
   ->  Bitmap Index Scan on ix_events_doc  (cost=0.00..24.47 rows=120 width=0)
   Heap Blocks: exact=104
   Buffers: shared hit=18 read=86
 Execution Time: 1.184 ms

Read your plans before deciding. Evergreen #6 covered EXPLAIN ANALYZE - that's the tool that tells you whether the index you built is the index the planner actually picked. Building a jsonb_ops index for a query that only uses @>, then watching it sit untouched on disk eating write throughput, is a real-life mistake.

GIN writes are not free. Every insert and update touches every posting list the new document hits. On a wide document with dozens of fields, a single insert touches dozens of lists. The fastupdate option batches these into a pending list merged later, smoothing per-row cost but trading it for occasional vacuum spikes. For very write-heavy tables, dropping the index during a bulk load and rebuilding after is often faster.

The third indexing path - the one most teams reach for too late - is the expression index. If you query one field a lot (WHERE doc->>'status' = 'active'), GIN over the whole document is overkill. A targeted btree on the expression is smaller, faster, and lets the planner use it for sorts and range queries:

CREATE INDEX ix_events_status ON events ((doc->>'status'));

Now the planner uses it like any other column, and the GIN can be dropped if no other query needs it. Most production jsonb workloads end up with one expression index on the single hot field and either no GIN or a small jsonb_path_ops for the rare containment queries.

The decision framework

The question isn't "columns or jsonb". It's "what does the access pattern look like". Walk through it honestly.

Jsonb wins when:

Shape varies per row and you genuinely don't know it. Event payloads from heterogeneous sources, per-tenant custom fields, plug-in metadata - every row has a different set of keys, and forcing a schema means a sparse table with hundreds of mostly-null columns.
You read documents wholesale. One SELECT returns the whole thing and the application picks it apart. No WHERE doc->>'field' = ... predicate on twenty different fields.
Writes are mostly full-document replacements. You're rewriting the document or appending rows, not doing partial updates. No write amplification because every write was going to be the full document anyway.
Index size on a few hot fields is fine. You need GIN on one or two fields, not twenty. The index stays bounded.

Columns win when:

Shape is stable and you know it. A user has an email, a created_at, a tenant_id. These fields don't disappear or rename themselves. Make them columns.
You query by field with predicates and joins.WHERE created_at > $1 AND status = $2 AND tenant_id = $3 against a jsonb document with three expression indexes is the same query you'd write against three columns, and slower at every step.
Writes are field-level partial updates. Updating one field should write the row, not 5 MB of TOAST chunks. Columns get this right for free.
You need foreign keys, check constraints, or column-level statistics. None of these work on jsonb fields. The planner has no histogram for doc->>'status', so its row estimates for that predicate are guesses. Statistics come from columns.

The hybrid is the most common production answer. A few stable columns - id, tenant_id, created_at, kind, status - plus a metadata jsonb for the long tail. The hot path queries against columns, the cold path digs into metadata when needed. You keep the planner's statistics, the option to add constraints, and the flexibility for fields that don't have a stable shape. Almost every production table that lasts three years ends up looking like this.

Migration patterns

You'll often want to extract a column from a jsonb document once the field is stable enough to deserve one. The clean pattern is a generated column - Postgres maintains it from the underlying jsonb, and you index it like any other column:

ALTER TABLE events
  ADD COLUMN status TEXT
  GENERATED ALWAYS AS (doc->>'status') STORED;

CREATE INDEX ix_events_status ON events (status);

Stored generated columns cost disk space but zero CPU on read. Virtual ones (Postgres 17+) cost no disk but recompute on access. For a hot filter field, stored is the right answer. Going the other way - collapsing a wide table into a metadata jsonb column - is rarer and usually a sign the shape was never stable. Add the column, backfill via jsonb_build_object, drop the source columns. The hard part is updating every writer and reader: do it behind a feature flag, in stages, with the jsonb column dual-written first.

Common mistakes

A few patterns that come up over and over when teams hit the wall:

Using jsonb as a substitute for schema design. "We'll figure out the shape later" means "we'll have three spellings of the same field by Q3". Sketch the shape on paper before the column hits the migration.
No GIN index when the workload runs containment queries. Sequential scans over millions of documents in the hot path. Read the plan, see the seq scan, build the right index.
jsonb_ops everywhere when jsonb_path_ops would be half the size. If the only operator touching the column is @>, jsonb_path_ops is the answer - smaller index, faster writes, same query speed.
Treating jsonb as searchable text. LIKE '%foo%' can't use any GIN you've built. For full-text search, use a tsvector column.
Reading large jsonb documents to extract one field. SELECT doc->>'status' FROM events on 5 MB documents detoasts every row. An expression index or stored generated column avoids the detoast entirely.
Forgetting that jsonb doesn't enforce types. The same key holds a string in one row, a number in another, a null in a third. Check constraints with jsonb_typeof are the only thing standing between you and a downstream parsing bug.
Partial updates on huge documents. Every jsonb_set rewrites the whole thing. 5 MB document, 50-byte field, you're paying 100,000x. Split the document or move the volatile field into a column.

Autovacuum was storage. Explain analyze was the planner. Isolation was concurrency. Jsonb is data model. Four sides of one box - and most production correctness and performance problems live in exactly one of them. The Postgres-fundamentals arc closes here, but the box is the thing to remember. Every regression you'll chase for the next year sits on one of these faces, and you now know which face to start with.

Transaction isolation: when read committed quietly skips your row

Ilia Gusev — Wed, 10 Jun 2026 14:01:33 GMT

Postgres ships with READ COMMITTED as its default isolation level, and neither Django nor Rails will tell you. You read "ACID" on the marketing page, assume your update_balance() function is safe under concurrency, and ship. It isn't. READ COMMITTED allows lost updates, non-repeatable reads, phantoms, and write skew - and a banking app with two concurrent transfers can credit one account twice if the code does the natural read-then-write pattern, with Postgres committing both without a single error in the log.

The django update that lost rows

Picture the simplest function in any system that touches money. Read balance, add deposit, write back, commit.

def credit(account_id, amount):
    with transaction.atomic():
        acct = Account.objects.get(id=account_id)
        acct.balance += amount
        acct.save()

Two requests arrive simultaneously, each crediting 100. A reads balance = 500. B starts a millisecond later, reads balance = 500 too - A hasn't committed yet, so B can't see A's work. A computes 600, writes, commits. B computes 600, writes, commits. Final balance: 600. Customer credited 200, account moved by 100, one deposit is gone, no error anywhere.

This is the lost update anomaly, the most common production correctness bug in OLTP systems. Not "common in 1995 textbooks" - common right now, in Django apps shipping this week, because the ORM gives you a transaction and you assume the transaction did the locking. It didn't. Under READ COMMITTED, each statement sees the latest committed data, but two transactions can both read the same row, both compute new values from it, and both write back. The database serializes the writes - last writer wins. The arithmetic loses.

This bug is durable because it doesn't show up in load tests with one client, and it doesn't show up in unit tests at all. It only fires when two requests collide on the same row, which on most apps means it fires occasionally in production, looks like a "weird discrepancy", gets blamed on a flaky integration, and lives in the codebase for years. Evergreen #4 showed how the visibility horizon holds vacuum back; the same horizon is what makes snapshot isolation possible.

What the four ANSI isolation levels mean

The 1992 SQL standard defines four levels in terms of which anomalies they allow. The anomalies were the framework, the levels were the rungs.

The first is dirty read - reading data from a transaction that hasn't committed yet. If A writes balance = 0 and hasn't committed, B should not see balance = 0. If A rolls back, B made a decision on a value that never existed.

The second is non-repeatable read - reading the same row twice in one transaction and getting different values, because another transaction committed an UPDATE in between. Read balance, do some work, read balance again, it changed underneath you.

The third is phantom read - running the same range query twice and getting different rows back. SELECT * FROM orders WHERE user_id = 4711 returns 3 rows the first time, 4 rows the second time, because another transaction inserted a new order that matches your predicate.

The fourth, which ANSI missed and Berenson et al. pointed out in 1995, is write skew. Two transactions read overlapping data, each decides based on what the other can't see, both commit writes consistent with their own snapshot but together violate a constraint nobody enforced. Two doctors on call. Both transactions read "at least one doctor on call", both decide "I can go off-call because the other one's still there", both UPDATE themselves to off-call, both commit. Zero doctors on call.

The standard maps anomalies to levels mechanically:

READ UNCOMMITTED: dirty reads allowed, plus everything below.
READ COMMITTED: no dirty reads. Non-repeatable reads, phantoms, and write skew all allowed.
REPEATABLE READ: no dirty reads, no non-repeatable reads. Phantoms allowed (in ANSI). Write skew allowed.
SERIALIZABLE: no anomalies. Transactions behave as if they ran one after the other.

The catch is that "behave as if they ran one after the other" is a strong promise. To deliver it cheaply, real databases either lock aggressively or let transactions proceed optimistically and abort some at commit time when a conflict shows up. Postgres makes one bet, MySQL InnoDB makes another, Oracle makes a third. The level name is the same, the behavior isn't.

That's where most production bugs live - in the gap between "I asked for REPEATABLE READ" and "this database's definition of REPEATABLE READ". You have to know what your engine actually does.

How postgres actually implements these

Postgres has its own opinions, and they're worth memorising once instead of guessing every time.

READ UNCOMMITTED doesn't really exist - if you ask for it, you get READ COMMITTED. Dirty reads aren't possible on MVCC the way Postgres does it, because every row version carries the xmin of the transaction that wrote it, and readers skip versions whose xmin hasn't committed.

READ COMMITTED is the default. Every statement gets a fresh snapshot. Two statements in the same transaction can see different data because another transaction committed in between. Fast, and where lost updates and write skew quietly happen.

REPEATABLE READ in Postgres is actually snapshot isolation. The whole transaction sees one snapshot taken at the start, regardless of what commits later. This is stronger than ANSI requires - phantoms are allowed at this level by the standard, Postgres prevents them because the snapshot excludes newly-inserted rows from other transactions. You can still get write skew, because two transactions on two disjoint snapshots can each decide something the other isn't allowed to see.

SERIALIZABLE is implemented as SSI - Serializable Snapshot Isolation. Snapshot isolation plus a runtime dependency tracker that watches for dangerous patterns of read/write conflicts between concurrent transactions. When it spots one, it aborts the second transaction to commit with SQLSTATE 40001, could not serialize access due to read/write dependencies. SSI is cheap - no extra locks, no extra blocking - but it shifts the cost to commit-time aborts.

BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT count(*) FROM doctors WHERE on_call = true;
-- application logic checks count >= 1
UPDATE doctors SET on_call = false WHERE id = 1;
COMMIT;
-- ERROR:  40001: could not serialize access due to read/write dependencies
-- among transactions

The 40001 SQLSTATE is the contract. Any code that opens a SERIALIZABLE transaction must catch it and replay the whole transaction from the start - not just the failed statement, since the snapshot was taken at BEGIN. New BEGIN, fresh snapshot, redo the whole thing. If you don't have retry logic, you don't have SERIALIZABLE - you have a database that occasionally fails the request and tells your user something went wrong.

When you actually need serializable

Most apps don't. READ COMMITTED with proper SELECT FOR UPDATE row locks covers maybe 95% of real-world cases. The natural pattern is: lock the rows you're about to modify, then read-modify-write, then commit. As long as you lock everything you read for-decision, lost updates and write skew are blocked because the second transaction blocks on the lock until the first commits, then re-reads, then sees the fresh value.

SERIALIZABLE matters when row-level locks aren't enough. Three signals:

You can't enumerate the rows to lock ahead of time. The doctors-on-call example is canonical - the constraint is "at least one doctor on call", which depends on the count of rows matching a predicate, not on a specific row. Row locks can't protect a predicate. SSI's dependency tracker has predicate locks built in.

The access pattern is complicated enough that getting locking right by hand is error-prone. Six tables, three indirections, a constraint that spans them. Either reason through every lock-acquisition order yourself, or ask the database to figure it out and abort conflicting transactions. The second is easier to get right.

Correctness matters more than throughput. Banking, financial reconciliation, inventory with reservations - anything where a wrong commit is worse than a slow one. Trade-off is real: every transaction can fail with a serialization error, every retry costs latency, and on hot rows retry rate can spike. If you can't make your transactions retry-safe, you can't use SERIALIZABLE. That alone disqualifies a lot of codebases.

The select for update pattern done right

SELECT FOR UPDATE is the workhorse. It acquires a row-level write lock on every row the query returns, holds it until the transaction commits or rolls back, and blocks any other transaction that tries to lock or update those rows. The order-of-operations rule is: lock before you read for decision, not after.

The lost-update bug from earlier fixes with one line:

BEGIN;
SELECT balance FROM accounts WHERE id = $1 FOR UPDATE;
-- now this row is locked. application computes new balance.
UPDATE accounts SET balance = $2 WHERE id = $1;
COMMIT;

Transaction B's SELECT ... FOR UPDATE blocks behind A's lock until A commits, then sees the fresh balance and computes correctly. Two writes, both correct, no anomaly. The cost is that B waits.

FOR UPDATE also takes locks on rows referenced by foreign keys, which surprises you on schemas with a lot of FK fanout. If you're just modifying a non-key column, FOR NO KEY UPDATE is the lighter variant - same protection against concurrent writers, weaker lock on FK references.

Two modifiers worth knowing. SKIP LOCKED makes the lock attempt skip rows another transaction already locked, instead of waiting. This is the queue-consumer pattern: ten workers each grab the next available job without stepping on each other.

SELECT id FROM jobs
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;

Each worker quietly skips rows other workers are processing. No coordination, no Redis, no Kafka. Just Postgres.

NOWAIT makes the lock attempt fail immediately if the row is already locked. Useful for fail-fast paths where blocking would be worse than reporting "busy, try again".

The trap is locking after you've already read. If you do SELECT balance ... ; UPDATE ... without FOR UPDATE, you've gone back to the lost-update case. The lock has to be acquired during the read, not afterwards. pg_locks joined against pg_stat_activity shows you what's locked and who's holding it when contention bites.

Common mistakes

A few patterns that come up over and over in concurrency tickets:

Assuming "ACID" means serializable. ACID is a marketing umbrella. The "I" is whatever isolation level you actually configured, and the default is the weakest useful one.
Doing read-then-update without FOR UPDATE. The classic lost-update path. If you read a value to compute a new one, lock the row during the read.
Wrapping logic in BEGIN; ... COMMIT; and assuming Postgres will serialize for you. It won't. Transactions don't lock rows just because they're transactions.
Catching SQLSTATE 40001 and logging it instead of retrying. The whole point of SERIALIZABLE is that the database tells you when to retry - if you don't, you've taken the cost and gotten none of the benefit.
Mixing isolation levels in one connection pool. A pool that hands out READ COMMITTED and SERIALIZABLE sessions interchangeably is a debugging nightmare. Set the default per-database or per-role, not per-session.
Holding row locks across application logic - HTTP calls, queue publishes, long computations. Locks held that long cause deadlocks under load and turn into incidents. Lock late, commit fast.
Using LOCK TABLE because SELECT FOR UPDATE felt insufficient. Table locks block every reader and writer. Almost never the right answer; advisory_lock for application-level mutual exclusion is usually what you actually wanted.

Storage was Evergreen #4, planner #5, concurrency today. Next: JSONB - when the document column is the right call, when it's a documented mistake, and how the same MVCC machinery that powers snapshot isolation makes JSONB updates rewrite the whole document every time you touch one key. The isolation level you picked is the top of the iceberg. What it costs in dead tuples and bloat is the next layer down.
-->

Issue #021 - Talos Linux: the Kubernetes-only OS that removed SSH entirely

Ilia Gusev — Tue, 09 Jun 2026 14:01:27 GMT

Two years ago, on a cluster that wasn't even mine to fix, I tracked a scheduling failure down to a hand-rolled iptables rule dated 2022 - owner long gone, comment in the rule unhelpful, traffic on the new CNI's port quietly dropped. Two of the other thirteen nodes had the same rule. Eleven didn't. Nobody knew when the cluster had turned into fourteen subtly different operating systems, but it had, one 4am fix at a time.

Talos's answer is to make that story impossible. You can't SSH into a Talos node - there's nothing listening on port 22. No sshd, no shell, no package manager, no /etc you can hand-edit. The node accepts one thing: an authenticated gRPC call to talosctl. Everything else - manual patching, ad-hoc rules, drift - gets removed at the source by removing the surfaces that enable it.

This issue is the closer for the 4-week cycle. Issue #18 moved cluster state out of Git into OCI. Issue #19 looked at the silent failure of expiring tokens. Issue #20 made image pulls disappear from cold starts. This one is about the node itself becoming an artifact you replace rather than a server you log into.

🏗️ Architectural Pattern: OS as immutable image

What Talos actually is

Strip a Linux distribution down to "the things a kubelet needs to run, and not one binary more," then refuse to let anyone add anything else. That's Talos. The whole OS lives in a single compressed image, around 80 MB. The rootfs is squashfs, mounted read-only at boot. There's no /usr/bin/bash because there's no bash. No apt or dnf because there's nothing to install. No /etc/passwd to edit because there are no users to log in as. The init system isn't even systemd - it's a Go binary called machined that's also the API server for the node.

When you boot a Talos node, three things happen in order. The kernel loads, machined starts, and machined reads a single file called machineconfig.yaml. That file is the entire configuration: which cluster to join, what control-plane endpoints exist, what the disks look like, which CNI to use, what NTP servers to trust, which kernel modules to load. One file, declarative, applied at boot. No cloud-init, no Ansible playbook, no role assignment over SSH after the fact.

The shape of that config matters. Here's the minimum needed to register a worker:

version: v1alpha1
machine:
  type: worker
  token: 
  ca:
    crt: 
  certSANs:
    - 10.0.0.10
  kubelet:
    extraArgs:
      rotate-server-certificates: "true"
  network:
    hostname: worker-01
    interfaces:
      - interface: eth0
        dhcp: true
cluster:
  controlPlane:
    endpoint: https://10.0.0.10:6443
  network:
    cni:
      name: cilium

That's the whole machine. No layer of templated cloud-init over the top, no role-based provisioning that fills in different bits depending on whether this node ended up in the gpu pool or the data pool. The config is the contract. If two nodes have the same config, they are the same node, byte for byte, after they boot.

The COSI resource model

There's a piece of Talos that doesn't get mentioned enough, and it's the part that makes the rest hang together. Talos exposes everything on the node as resources in a Kubernetes-style model called COSI (Common Operating System Interface). Network interfaces, mounted disks, kubelet status, running services - all the things you'd normally inspect with five different CLI tools - show up as one queryable resource tree. You read it with talosctl get, the same way you'd kubectl get a pod.

$ talosctl get nodeaddresses
NODE         NAMESPACE   TYPE            ID              VERSION   ADDRESSES
10.0.0.21    network     NodeAddress     default         3         ["10.0.0.21/24"]

$ talosctl get services
NODE        NAMESPACE   TYPE      ID         VERSION   RUNNING   HEALTHY
10.0.0.21   runtime     Service   apid       2         true      true
10.0.0.21   runtime     Service   kubelet    3         true      true
10.0.0.21   runtime     Service   etcd       2         true      true

It's the same architectural move Kubernetes itself made for workloads - everything is a resource, everything is observable through a uniform API, no special tooling per subsystem. Linux's whole "everything is a file" pitch is the historical version of this idea, but in practice you ended up with iproute2 for one thing and systemctl for another and cat /proc/whatever for a third. COSI puts it all under one query interface, and that interface happens to be the only way to look at the node at all.

Upgrades that don't drift because they can't

Traditional node lifecycle looks like this. Ubuntu 22.04 LTS as the base, apt update && apt upgrade on a weekly cron, kubelet from one Kubernetes repo, container runtime from another, kernel patches that mean rolling reboots one node at a time. Six months in, two of your fifty nodes end up on a slightly different containerd because some repo cached weirdly during one rollout. You don't know about it. You find out during an incident.

Talos doesn't have that loop because there's nothing to update incrementally. An upgrade is a new image. Period.

$ talosctl upgrade --nodes 10.0.0.21 \
    --image ghcr.io/siderolabs/installer:v1.8.2

What happens under the hood is straightforward: Talos has two root partitions on disk, A and B. The running OS is on A. The upgrade command writes the new image to B, flips the bootloader entry, and reboots. If the new image fails to come up healthy within a timeout, the bootloader falls back to A on the next reboot. The on-disk state for the cluster - etcd data, mounted volumes - lives on its own partition that survives the swap. The OS itself is a sealed artifact that gets replaced atomically.

This is the same A/B partition pattern ChromeOS pioneered for laptops and that Android adopted years later. Talos brings it to Kubernetes nodes. The bet is identical: if the OS is small, sealed, and replaced as one unit, there is no surface for drift. There's no concept of "this node has been patched 47 times and the next one has been patched 49 times." There's the version of the installer image you booted from. That's the node's identity.

Compare that to the Ubuntu-or-RHEL alternative. The system was designed to be modified after install - that's the whole point of a general-purpose distro. Package managers exist to add things. systemd-resolved gets a config flag. cron gets an entry. A junior on-call adds a tc qdisc to "fix" a latency problem at 4am. None of it is recorded anywhere except in the running kernel's state, and none of it survives a reinstall, so the reinstall is the thing you're terrified of doing.

Talos inverts that fear. The reinstall is the cheap operation. The hand-edit is the expensive one, because there is no hand-edit. You change the config and reapply; the node converges. Same config, same node. Different config, different node. The state isn't hidden in seventeen places.

This connects directly back to Issue #18. GitOps without Git wasn't really about Git - it was about taking what used to be in-band (manifests fetched from a Git remote, rendered on the cluster) and moving it to a content-addressed artifact in a registry. Talos is the same rearrangement at the node layer. What used to be in-band (configuration changes made over SSH against a long-lived server) becomes a content-addressed artifact: an installer image and a config file. The cluster pulls both, the cluster applies both. The arrow shortens.

🔥 Hot Take: SSH is a fleet-scale anti-pattern

The honest version of "for debugging"

Every team that runs Kubernetes at any scale ends up with a Slack message somewhere that reads "can you SSH into node-04 and check if /var/lib is full." It's so normal the question feels harmless, and it isn't.

What happened on node-04 between login and logout isn't recorded anywhere. The time it bit me, someone had tweaked /etc/sysctl.conf months earlier to chase a TIME_WAIT problem off a 2019 Stack Overflow answer, logged out, and never wrote it down; I lost a week to it. The audit log gives you two timestamps - in at 03:14, out at 03:41 - and nothing in between. Stretch that across forty nodes and two years of on-call and the cluster on paper quietly stops being the cluster that booted. The third time that drift was the root cause of a Sev-2, I quit calling it a flaw. It is the model, not a bug in it.

So Talos removes the entry point. Not made-it-opt-in. Not feature-flagged-off. The sshd binary isn't shipped. There's nothing to disable because there's nothing to enable, and you can't add it back without rebuilding the installer image from source. The bet underneath: no debugging case is important enough to justify owning that drift forever.

What replaces it

The objection writes itself: "Sometimes you need to look inside the node." Sure. Here's what Talos gives you instead.

Four talosctl verbs cover most of what people used SSH for. dashboard opens a curses-style live view of CPU, memory, interfaces, services, kernel messages - a read-only window onto the COSI tree, and it covers about 80% of normal-day investigations. logs streams the journal for kubelet, containerd, etcd, or machined itself, over a gRPC channel authenticated with mTLS. read opens files from a sandboxed allowlist (/proc//status, parts of /sys/class/net/*, and so on); /etc/shadow is on the deny side, but Talos doesn't ship one anyway.

talosctl pcap is the one worth pausing on. It runs a tcpdump-equivalent against any interface and streams the capture file back to your laptop. Cluster-wide packet capture without any node having a shell - the kind of thing that traditionally forced an SSH session, and now happens through the same API as everything else.

And if you really need an interactive shell on the host, kubectl debug node/... from upstream Kubernetes creates an ephemeral pod with nsenter privileges into the host namespace. That session shows up in the K8s audit log under a real identity, stays scoped to the pod, and disappears the moment you exit. An auditor gets a paper trail that raw SSH never left.

The 10% that pushes you somewhere better

What's left is a short list: a kernel panic on a node so dead the API server can't reach it, a failing disk you want to read sensors off, the case where apid itself has crashed.

Two of those still happen inside the model. A panicked node boots to a serial console in "maintenance mode" with the same API, no cluster joined, enough to hand-recover it; failing-disk sensors come through COSI block-device and SMART resources, queried like any other service. The crashed-apid case is the one with no trick to it - you reboot the node, and that's the whole recovery procedure. No Houdini act on a half-dead box, which is the price of every other node looking exactly like its config says it should.

Then there's what Talos does to a team's debugging habits. The crew that lived in SSH had a hundred small workflows leaning on it - a one-liner that listed iptables rules across the fleet, a shell script that rotated logs in some bespoke directory, an overnight cron that snapshotted the whole config tree to "catch drift." None of that survives the move, because the surfaces those scripts touched don't exist anymore, so each one has to be rebuilt around whatever it was really solving. Drift-catching turns into an assertion that machined enforces continuously, the running config measured against the declared one with nothing left to babysit. Logging stops being a cron job and becomes real observability. The host-shell reflex ends up as a distroless debug container the team checks into the repo like any other tool. Each rewrite hands the cluster something it owns outright, instead of something that only lived in the team's heads.

The migration is real friction. But the friction is bounded - you do it once, you write the right tools, and then you have a fleet where every node is provably the same as the others. The traditional setup has unbounded friction: every incident teaches the team a new way to make the cluster slightly different from itself, and the bill comes due in some 4am that nobody can fully reconstruct afterwards.

This is the same dynamic Issue #19 was about, just at a different layer. Bound SA tokens fail silently because the legacy assumption (a token lives forever once minted) was already the failure mode - the cluster had been quietly compensating for a broken expectation for years, and one upgrade exposed it. Drift is the silent-failure version of that for nodes. The cluster looks fine, the workloads run, until one day they don't and you find out node-04 has been running a different kernel for eighteen months.

🆚 Showdown: Talos vs Kairos vs Flatcar

Three immutable-OS projects, three different bets, three different ideas about how much of the traditional Linux you're willing to throw away.

Talos: K8s-only, no userland, API-driven

Talos is the most opinionated of the three. The userland is gone. Not minimized - gone. The OS is a kernel, an init binary, the kubelet, containerd, etcd if it's a control-plane node, and a small handful of supporting services. There is no shell of any kind on the running system. The interface to the node is talosctl, full stop.

Configuration is one YAML file. Upgrades are image swaps. Networking can use KubeSpan, Talos's built-in WireGuard mesh that gives every node-to-node link a wire-encrypted tunnel without you wiring up the mesh yourself. The control plane runs etcd directly on the host with sane defaults, no need to babysit it as a separate concern.

The clusters where this pays off are the ones where the team owns the whole stack and wants production to match its config byte for byte - managed fleets, edge boxes that are a four-hour drive away when they go AWOL, platform teams whose contract with their users is "the cluster works the same way every Tuesday."

Where it fails is more specific, and I've hit it more than once. A team had a legacy operator that mounted /var/log from the host and shelled out to rotate something; on Talos that whole assumption evaporates, and there's no flag to bring it back. Custom kernel modules and non-containerized compliance agents are the same problem one layer down - the escape hatches you were relying on are simply gone. For the teams Talos fits, that absence is the whole point; the ones it doesn't fit usually find out on day two.

Kairos: meta-distribution with a userland

Kairos starts from the opposite direction. You give it a container image of any base Linux distro - Ubuntu, openSUSE, Alpine, Rocky, whatever your team already knows - and it wraps that into an immutable OS using the same A/B partition pattern, with K3s or full K8s baked in. The userland of your starting distro comes along for the ride, which is the entire reason to pick it.

The first time you SSH into a Kairos box it feels like a normal Linux system, and you edit /etc the way you always did. Then you reboot and the edits are gone, because the rootfs is immutable and your changes lived in an overlay that the next image upgrade wipes. That's the whole bargain in one gesture: immutability guarantees on the storage layer, the familiar shape of a Linux box on the operational layer. Configuration is a YAML file here too, but Kairos runs cloud-init under the hood, and upgrades are container-image pulls unpacked to the inactive partition the same way Talos does it.

Migration is where I'd actually reach for it. A team moving off a traditional distro that wants to keep its runbooks, muscle memory, and SSH habits intact for a while can do exactly that. The same forgiveness covers edge boxes where one hardware-diagnostic SSH session a year is genuinely useful, and mixed workloads with a pod that talks to host-level userland nobody wants to containerize - A/B atomic upgrades without signing up for the full Talos paradigm shift.

That forgiveness comes with a bill. Kairos is more flexible than Talos, and flexibility cuts both ways: the door is open, so eventually somebody walks through it, and the drift surface is smaller than Ubuntu's but a long way from zero. The image is your base distro plus Kairos's overlay, bigger and more complex than Talos's 80 MB. I've watched people pick Kairos expecting it to tighten into Talos over time and end up frustrated - the off-ramp is the feature, not a stepping stone. Immutability you can ease into is the right call if you signed up for migration, and the wrong one if you wanted the strict regime from day one.

Flatcar: the CoreOS lineage, with auto-update

Flatcar is the most familiar of the three to anyone who ran CoreOS Container Linux back in the day - because it's the same thing, forked when Red Hat sunsetted CoreOS, kept alive by Kinvolk and now Microsoft.

It looks like a minimized traditional node, and that's deliberate. The /usr partition is read-only and there's no package manager, but /etc and /var are small and writable, SSH is right there (gated on systemd, configured through Ignition, the declarative provisioning tool Flatcar inherited from CoreOS), and a container runtime ships in the box. The userland sits between the two extremes - leaner than Ubuntu, nowhere near Talos's nothing.

What sets it apart is auto-update. Flatcar nodes phone home to a public update server (or your own mirror) on a schedule, stage new versions into the inactive partition in the background, and reboot when Locksmith or FLUO say it's time. You're not running apt update on a cron; the OS does the equivalent on its own clock, with A/B safety the whole way.

Reach for it when you want immutable-ish without giving up SSH or systemd - the "I just want a CoreOS that's still maintained" case, or a cluster run as managed cattle rather than locked down, where you trust the team not to drift the nodes and also trust them to know what to do when one breaks. It's production K8s with sensible immutability and zero appetite for a paradigm shift.

The cost is that it's still shaped like a traditional node. The drift surfaces are smaller but they're there: anyone with SSH can hand-roll an iptables rule, and Flatcar's writable paths mean that rule rides through the next update. Auto-update carries a tail of its own - every so often a release breaks something on your specific hardware, you pin a version to recover, and the moment you do you've reintroduced the "is everyone on the same version?" question that Talos's strict regime had eliminated.

The trade-off axis

Pick any axis you like - drift surface, learning curve, debuggability, operational risk, how much your old runbooks still apply. They all map onto the same gradient.

Talos is at one end: maximum paradigm shift, minimum familiarity, smallest possible attack and drift surface, requires you to rebuild your operational tooling, gives you the strongest guarantees in return. Kairos is in the middle: immutability with an off-ramp, you keep your userland, you give up some of the strictness, the tradeoff is "easier migration, slightly worse guarantees." Flatcar is at the familiar end: immutable rootfs, but the shape is still a Linux box you can SSH into, the migration is cheap, the drift surface is small but real.

There's no objectively right answer. The honest question is which side of that axis your team's incidents come from. Drift from someone fixing things over SSH at 3am? Talos's strict regime is the cure. Worst outages mostly operational unfamiliarity, runbooks that won't survive a paradigm change? Flatcar lets the runbooks live. And if it's somewhere in between with the door deliberately left open, Kairos is the answer.

For Podo Stack readers running platform teams of any size, I'd bias toward Talos for new clusters and Kairos for migrations. Flatcar is the right call when "immutable-ish CoreOS replacement" is literally what you set out to find. All three are CNCF-relevant and production-tested at scale - this isn't a "pick one carefully or you'll regret it" choice, it's "pick the one that matches the cultural change you're willing to make."

What this cycle was about

Four issues - Gitless GitOps, silently-expiring SA tokens, image preload, immutable OS. Different layers, but the same argument kept surfacing underneath each one.

It comes down to a single question: what does the cluster actually trust, and where does that trust come from? Gitless GitOps moved the answer off a Git branch anyone could rewrite and onto an OCI artifact pinned by digest, signed by a CI workflow you can verify. The bound-token postmortem was the same lesson dragged out of a real incident - the only credential worth anything is the one the kubelet rotated 30 seconds ago, and the copy your operator cached back in 2024 is just lying there on the floor. Then preload pushed the question one hop further: trust that the bytes are already on disk before you schedule, not the registry's tail latency under load. Talos is where it bottoms out. The node is the bytes in its installer image, full stop - not whatever some on-call typed into a shell at 3am.

I didn't plan for the four issues to rhyme. It only clicked around the preload draft, when I caught myself making the same move a third time - pushing the thing the cluster trusts one step closer to the thing it actually runs. By this issue there's no gap left to close, since the node is its own installer image. Funny how a month of separate topics turns out to have been one topic wearing four hats.

Next cycle picks up a different thread. See you Tuesday.

- Ilia

Postgres explain analyze: how to read what the planner picked

Ilia Gusev — Fri, 05 Jun 2026 14:01:18 GMT

It was a Tuesday when a dashboard query that had run in 80 ms all quarter started taking three seconds. Nothing in the code had changed. I pulled EXPLAIN ANALYZE and the plan had quietly rearranged itself: where it used to build one hash join, it was now running a Nested Loop and probing an index 14,821 times. The planner hadn't broken - it had been handed a bad number and reasoned correctly from it. Everything below is the walk back through that plan, one column at a time, because reading those columns side by side is how we found the single stale statistic behind the whole thing.

When seq scan wins

The first thing I had to rule out that morning was the scan sitting at the bottom of the plan, because a Seq Scan on a big table always looks like the obvious culprit. Sequential scan reads every page in the table from start to finish. Index scan walks the B-tree, then for each match jumps back to the heap to fetch the row. That second jump is random I/O, expensive enough that the planner keeps a knob for it - random_page_cost, default 4.0, against seq_page_cost = 1.0, four times the price per page. Once a query touches a big enough fraction of the rows, reading them in physical order wins outright.

On a small table it isn't even close. A 10,000-row table fits in a handful of pages, and reading all of them is cheaper than even one index lookup with its round-trip overhead. The planner picks Seq Scan here every time, and the index we added just sits there, never chosen.

The fights are usually about low selectivity. We had this exact query flagged as a missing index more than once:

EXPLAIN ANALYZE
SELECT id, email FROM users WHERE active = true;

                                  QUERY PLAN
-----------------------------------------------------------------------------
 Seq Scan on users  (cost=0.00..1834.00 rows=80000 width=36)
                    (actual time=0.012..18.421 rows=80127 loops=1)
   Filter: active
   Rows Removed by Filter: 19873
 Planning Time: 0.142 ms
 Execution Time: 22.110 ms

active = true matches 80% of rows. An index scan there would touch 80,000 random heap pages when one sequential pass reads them all in order, so the planner does the math and stays with Seq Scan. We tried forcing the index once and got a slower plan; the rows actually worth indexing were the rare active = false ones, and a partial index on those is what paid off.

The one that fooled me on that Tuesday was stale statistics, since nothing in the query itself had changed. Postgres keeps row counts and value distributions in pg_stats, and when those drift the cost model drifts with them - the plan looks stupid until you run ANALYZE and it sharpens up again. We covered in Evergreen #4 why vacuum holds the line on disk; ANALYZE is the same daemon's other job, keeping the planner honest, and it's the quiet reason a plan regresses right after a bulk load.

What explain actually reports

Reading my way back to that bad number meant knowing what each node was actually saying, and every node in an EXPLAIN output carries the same shape:

Seq Scan on orders  (cost=0.00..18334.00 rows=1000000 width=72)
                    (actual time=0.014..421.337 rows=998421 loops=1)

cost=A..B is two numbers. A is the startup cost - how much work happens before the first row pops out. For Seq Scan it's basically zero. For a Sort it's the cost of the entire sort, because a sort can't return a row until it has seen all of them. B is the total cost to produce every row the node emits. The unit is arbitrary, calibrated so that reading one sequential page from disk costs 1.0, and everything else gets priced against that single anchor, from per-tuple CPU up to random page reads. I spent a while early on trying to read cost as milliseconds and it never lined up; the wall-clock numbers live in the actual time columns, and cost is only a currency for comparing plans.

rows=N is the planner's estimate of how many rows the node will produce. width=M is the estimated average row size in bytes, which feeds into memory-sizing decisions for sorts and hashes.

The actual time=X..Y rows=N loops=L part is what ANALYZE adds. It re-runs the query (yes, really, it executes it) and records what actually happened. X is the time to first row, Y is total time, both in milliseconds. rows is the truth.

loops is the multiplier you apply by hand to get real numbers. If a node sits inside a Nested Loop and runs once per outer row, loops=5000 means the displayed actual time and rows are per-loop averages. Total wall-clock for that node is actual time Ã— loops, not the number printed. The first time this bit me the plan read 0.02 ms and I believed it - the node had run five thousand times, so the real cost was closer to 100 ms.

Add BUFFERS and you see the I/O directly:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 4711;

                                       QUERY PLAN
------------------------------------------------------------------------------------------
 Index Scan using ix_orders_user on orders  (cost=0.42..28.45 rows=12 width=128)
                                            (actual time=0.041..0.082 rows=11 loops=1)
   Index Cond: (user_id = 4711)
   Buffers: shared hit=14
 Planning Time: 0.231 ms
 Execution Time: 0.118 ms

shared hit=14 means 14 buffer-pool pages, all in cache. shared read=14 would mean 14 pages pulled off disk. Cold-cache reads run hundreds of times slower than warm hits, which is why a query that takes 2 ms on your laptop can spike to 600 ms in production right after a restart - same plan, completely different buffer state. The day of the incident, BUFFERS is what told me the slow node was hitting cache, not disk, so I could stop chasing an I/O ghost and look at the row counts instead.

Two more things worth flipping on. VERBOSE shows output columns per node, which matters when you're chasing why a join is carrying a column it doesn't need. SETTINGS dumps any non-default planner GUCs in effect for the query, which catches the case where someone left enable_seqscan = off on a session.

Reading a plan tree

By this point I was reading the plan as a tree, printed depth-first with the root at the top. Each indented child produces rows that its parent consumes, but execution runs the other way: the deepest leaves go first, hand their rows up, and the root finishes last. That backwards order is why I read plans from the bottom, and on that Tuesday the bottom is exactly where the three seconds were hiding.

Hash Join  (cost=312.50..2840.12 rows=1200 width=96)
           (actual time=8.420..142.881 rows=1187 loops=1)
  Hash Cond: (o.user_id = u.id)
  ->  Seq Scan on orders o  (cost=0.00..2210.00 rows=50000 width=72)
                            (actual time=0.011..38.420 rows=49998 loops=1)
  ->  Hash  (cost=200.00..200.00 rows=9000 width=24)
            (actual time=8.120..8.120 rows=9012 loops=1)
        Buckets: 16384  Batches: 1  Memory Usage: 512kB
        ->  Seq Scan on users u  (cost=0.00..200.00 rows=9000 width=24)
                                 (actual time=0.008..3.140 rows=9012 loops=1)

Both Seq Scan nodes run first. users gets hashed into memory. Then orders is streamed past the hash and matched. The join node sits on top, total 142 ms, and the bottleneck is the orders scan, not the join itself.

The join at the top of my plan was the part that had flipped overnight, and there are only three shapes it could have taken. The one I'd been handed was a Nested Loop: it walks one side and probes the other once per outer row, which is brilliant when the outer is tiny - a dozen rows out, ten pulled by index from the inside, a dozen cheap probes and done. The trap is that the planner commits to it on nothing more than its guess about the outer's size. Guess small when the outer is really huge and those dozen probes become a hundred thousand, which is precisely the shape that ate my Tuesday.

What it should have stayed as was a Hash Join - build a hash table on the smaller side, stream the bigger side through it, one pass and out, as long as that smaller side fits in work_mem and the join is on equality. The hash node even tells you whether it fit: Batches of 1 and it stayed in memory, anything higher and it spilled to disk because work_mem was too tight.

That left one shape I almost never saw in our plans. Merge Join zips two already-sorted inputs together, no hash to build and no random I/O, but it only earns its keep when the sort comes for free - usually because both sides arrive straight off index scans on the join keys.

So I went back to the deepest leaf, the way I always do when the costs stop making sense. Joins amplify: that leaf was returning a thousand times the rows the planner expected, and every node above it had inherited the error until the whole plan derailed. Get the bottom node honest and the ones above it usually fall back into line.

The estimate-vs-actual gap

This is the column that broke my Tuesday, and the one I read first ever since: rows. The planner's estimate against the actual count is the most diagnostic number on the page - a 10x gap makes me look twice, and a 100x gap means the plan was built for a fantasy while the real query runs a completely different shape.

Seq Scan on events  (cost=0.00..18334.00 rows=12 width=72)
                    (actual time=0.014..2840.337 rows=14821 loops=1)
   Filter: ((tenant_id = 4711) AND (kind = 'login'))

Planner thought 12 rows. Reality is 14,821. That gap will pick Nested Loop everywhere because the planner thinks the outer is microscopic, and then probe an index 14,821 times instead of building one hash. Three seconds where it should have been 80 ms.

In our case it was the cause we hit most. An overnight batch job had loaded a few million rows into events, and pg_stats still reflected the pre-load distribution - that's where the estimate of 12 came from. ANALYZE events; snapped it back to something honest, the planner went straight back to a hash join, and the three seconds dropped to 80 ms. Inserts that skip that follow-up ANALYZE are behind most of the "fast yesterday, slow today" tickets we field.

Correlated columns get us more often than I'd like to admit. The planner assumes columns vary independently and multiplies their selectivities, so WHERE country='US' AND state='CA' comes out as 5% Ã— 2% = 0.1%, even though every CA row is already a US row and the honest number is 2%. A 20x under-estimate like that is plenty to drop a Nested Loop on the wrong side; CREATE STATISTICS with dependencies or ndistinct is what teaches the planner that the two columns travel together.

The one that burned a multi-tenant database I worked on was skew. Default stats only keep the top 100 most-common values per column, so a hot tenant sitting just outside that top 100, holding a tenth of the rows, looks like average frequency to the planner. Raising default_statistics_target for the column, or to 500-1000 globally on a big table, widens the histogram until it catches that long tail.

Other times the value the planner needs is hidden behind a function. WHERE lower(email) = 'foo@bar.com' can't touch the stats on email at all, since there's no model for what lower() does to the distribution; you index the expression itself or add extended statistics on it, and the same blind spot turns up with date functions, jsonb extractors, anything that wraps a column.

I look for the lowest node where the rows= estimate misses actual by more than 10x. That's the one feeding bad numbers up the tree, so its statistics get fixed first, then re-plan and see what's still ugly.

Tools and shortcuts

After that Tuesday we stopped trusting that we'd be at the keyboard when the next plan flipped. The queries that hurt misbehave when nobody's watching, and you want the plan captured at the moment it goes wrong, not reconstructed from memory the next morning.

auto_explain is the built-in contrib module that logs plans for any query slower than a threshold. Drop this in postgresql.conf:

shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = '500ms'
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_nested_statements = on
auto_explain.log_format = 'json'

Any query past 500 ms gets its full EXPLAIN (ANALYZE, BUFFERS) written to the Postgres log, and JSON is the format the visualisers want. The one caveat: log_analyze = on adds timing overhead per query, usually a few percent. We shipped it anyway - the first cold-cache mystery it explained had already cost us more than the overhead ever would.

Alongside it, pg_stat_statements keeps a running tally per query fingerprint - total time, mean time, rows, buffer hits. The query I reach for first is SELECT * FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;, because it ranks offenders by total impact rather than by the slowest single call, and that ranking is what tells me which queries are even worth an EXPLAIN. One more line in the config, log_min_duration_statement = 1000, logs the text of anything slower than a second, and paired with auto_explain you get the query and its plan in the same log entry.

When it comes to actually reading a captured plan, I paste it into explain.dalibo.com or explain.depesz.com. Both colour-code the tree by timing, and Dalibo paints the estimate-vs-actual gap red, which is the column I end up staring at first anyway.

Common mistakes

Most of the misreads I still catch come down to trusting a label over the numbers. An index scan isn't automatically faster than a sequential one - on a small table or a wide filter the Seq Scan really is the right call, and forcing the index only slows it down. The cost figures are just the planner's guess, so they tell you less than the gap between estimated and actual rows ever will. And EXPLAIN ANALYZE on a DELETE or UPDATE runs the statement for real, which is why those go inside BEGIN; ... ROLLBACK; for me, or lose the ANALYZE entirely when all I want is the plan.

The rest are habits I had to build the hard way. I keep BUFFERS on, because without it there's no telling warm-cache fast from cold-cache slow, and that's how a buffer-pool problem gets blamed on the query instead. I run ANALYZE after every bulk load now, since fresh data on stale stats is exactly what cost me that Tuesday. A loops count above 1 gets multiplied out before I say anything about a node's time, because that's cardinality and not a bug. And I never fully trust a bare EXPLAIN: the plan it prints is only an intention, and the real numbers can still bend it once ANALYZE actually runs.

Storage was Evergreen #4, the planner was today. Next up is transaction isolation - what READ COMMITTED and REPEATABLE READ actually buy you when two writers collide - and after that JSONB, where it earns its place as a column type and where it quietly becomes a mistake.

OTel collector: the observability gateway nobody scales right

Ilia Gusev — Wed, 03 Jun 2026 14:01:08 GMT

The first time our OTel collector OOMed at 4 AM, I spent twenty minutes blaming the network. The pager said dropped spans during peak. The collector's own memory gauge read fine right up until the pod died with a 137. We had one replica funneling every span, metric, and log from about fifty services to three backends, and it had been doing that for months without complaint. That was the problem. It worked on day one with five services, the performance got normalized, and nobody scaled it after we onboarded the rest of the platform.

Subscribe now

We treated it like nginx. Deploy the chart, bump replicas if CPU gets hot, move on. The collector is not nginx, and I learned that the hard way over the next two weeks. It's a streaming pipeline carrying three different memory pressures in the same process, sharing one goroutine pool, fanning out to backends that each push back differently. You don't scale it by adding pods. We tried. It bought us four days.

This is what we found when we finally stopped restarting it and read how the thing actually moves data.

The week before the OOM

The symptoms had been there for weeks, and we'd been reading them as noise. The processor's dropped-span counter ticked up at every traffic peak. We logged that as backpressure we could live with. The send-failed counter on the exporter was noisier, but it only climbed when our tracing backend had a bad minute, so that one went on the backend's tab too. The memory graph was the reassuring one - it sawtoothed up each day and back down each night, the way a healthy process should, right until the Tuesday it didn't. The pod died, a new replica booted, and our APM graphs went blank for three minutes while it did.

What kept me restarting the pod instead of fixing it was the limiter. I checked twice that night that we had one configured, because I'd assumed a missing limiter was the whole bug. It was there. It just sat at a healthy number while the pod died around it. The heap profile I pulled at 4:40 showed why: the process had climbed 200 MB inside a single one-second check interval and OOMed before the limiter ever sampled. The gauge wasn't lying so much as I'd misread what it measured.

We'd inherited the default otelcol-contrib chart. One replica. The values file has a comment that says "scale as needed," and we never did. The collector quietly buffered more and more under load until it fell over, and the only outward sign was three-minute holes in dashboards that read like a flaky backend, not a dying collector.

The pipeline is a graph, not a config file

Once I started picturing the collector as a graph instead of a block of YAML, the failures stopped looking random. A pipeline is a tuple - some receivers, an ordered chain of processors, some exporters - and you run several of them in one process, usually one per signal type. We had exactly one, and that turned out to be the second thing wrong.

Each pipeline gets its own goroutine set for the processor chain. Receivers run their own goroutines feeding in, exporters drain from the end, and the connecting tissue is a series of in-memory queues. The night it fell over, our tracing backend slowed down for ninety seconds. The exporter queue backed up, which backed up the processor, which backed up the receiver - and because we'd shoved traces, metrics, and logs through one shared pipeline, the slow trace backend took our metrics down with it. We lost CPU dashboards during an incident because a tracing vendor had a bad minute. I still think about that one.

Memory was the part I got most wrong. The Go heap is shared across every pipeline in the process, so a spike in trace volume pushes out metric pipelines that were behaving fine. The limiter pushes back, but it does it by rejecting input at the receivers, not by isolating anything. When it says no, the receiver returns an error to the client, and that error is what the application sees as a dropped span. The limiter is a circuit breaker, not a memory bound - and I had been treating it as a bound for a year.

Here's the mechanism that bit us. The limiter checks heap on an interval, compares it against the soft limit plus the spike allowance, and refuses new data when it's over. Between checks, memory grows unobserved. Our interval was the default 1 second. Our batch processor could accumulate 200 MB of spans in about 800 milliseconds at peak. The limiter never caught it in time, the collector OOMed anyway, and the counters stayed green until the kill. The fix there was eventually to drop the interval to 250ms, but that came later.

The one-line bug we'd shipped to production

The batch processor exists for a good reason - most backends would rather receive 100 spans in one request than 100 requests of one span each. It collects until a size or a timeout and then ships. The cost is memory, and an unbounded batch buffer is the single most common way a collector grows itself into an OOM. Ours wasn't unbounded, but the order was wrong, and the order is load-bearing.

This is what we'd written on day one, and it had survived three reviews:

processors:
  batch:
    send_batch_size: 8192
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 512

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, memory_limiter]
      exporters: [otlp]

The batch processor buffers spans before the limiter ever sees them. The collector OOMs while the limiter's gauge reads fine - which is exactly the symptom I'd spent twenty minutes blaming the network for. We swapped the order of two processors and redeployed:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

After that, the limiter caught the data at the front of the chain instead of behind the buffer. Heap crossed the soft-plus-spike line, the receiver started refusing, the batch buffer stopped growing because nothing fed it anymore, and we watched the nightly sawtooth flatten within an hour of the deploy. I've since seen this exact inversion in three other teams' configs, and it ships past review every time because both orders parse and both work fine at five services.

A word on what "drop" means here, because it cost us a day of confusion. When the limiter rejects, the receiver returns a RESOURCE_EXHAUSTED gRPC error to the client, and the OTel SDK is supposed to retry with backoff. Some SDKs do this well. Ours, on one polyglot service, did not, and we lost data silently while the dashboards looked merely thin. If your tracing backend shows a sudden cliff at peak, check the receiver's refused-spans counter before you blame the network, which is the advice I wish someone had given me at 4 AM.

Rebuilding it as a gateway

Reordering processors stopped the nightly OOM, but one replica was still one replica, and tail-based sampling didn't work at all because no single instance saw a whole trace. So we rebuilt the topology. There are three patterns that show up in real deployments, and we'd been sitting on the one that doesn't scale.

The first is agent-only - one collector per node as a DaemonSet, applications send to localhost, and it batches and ships. This is the simplest model and it genuinely works well at small scale, say under 30 services and 500K spans per minute per node. We'd outgrown it without noticing. Its failure mode is that every agent talks to every backend, so you get N agents times M backends of connections, and some backends hate that. Tail sampling also can't work here, because no single collector sees the full trace.

The second is a gateway tier, which is where we landed. Agents on each node forward to a central pool, fronted by a load-balancing exporter that routes by trace ID so the same gateway instance sees every span of a trace. That last part is the whole reason tail sampling works:

exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otel-gateway.observability.svc.cluster.local
        port: 4317

The agents stay light, the gateway scales horizontally, and we paid for it with one more network hop and one more thing to run. The third pattern is a hybrid - agents do the cheap local work, batching and resource detection, while the gateway handles sampling and fan-out and anything needing a full trace view. Most setups past a few hundred services drift here, and we expect to as well.

Picking between them came down to two questions for us: did we need tail sampling, and how many backends were we fanning out to. We needed sampling and we had three backends, so the gateway was the obvious move. If you have one backend and make no full-trace decisions, agent-only is genuinely fine and the gateway is just overhead you'll resent operating.

What we watch now

The collector ran clean for the next quarter, and the changes that bought us that were boring. We pinned the limiter first and batch second in every pipeline. We split traces, metrics, and logs into separate pipelines so a slow logs backend can never again take out tracing the way it did that Tuesday. We gave every exporter a sending queue and exponential backoff, because the default queue is small and a 30-second backend hiccup used to mean we dropped spans until someone restarted the pod. And we sized the gateway pool to span rate - we started at roughly one pod per 20K sustained spans per second and then load-tested, because the collector behaves nothing alike at 10K and 100K spans per second and synthetic load is cheaper than learning that at 4 AM.

We also started scraping the collector's own telemetry, which I wish we'd done a year earlier. It exports its internal metrics through the obsreport hooks, and running it blind for that long is how the OOM crept up on us in the first place. We alert on the refused-spans counter and the exporter queue depth now, the same way we watch any other production service. The Prometheus WAL evergreen goes deeper on the same idea, an observability component that itself needs observing.

I went back through that month of dashboards afterward, and what got me was how quiet all of it had been. The memory pressure had built for weeks without tripping one alert, and the drops only ever reached us as thin graphs instead of pages. Every fix we shipped that quarter turned out to be a reordered processor or a split pipeline, never the extra replica I'd kept reaching for at 4 AM - the same config had run fine at five services and fallen over at fifty. The collector hasn't paged me since.

Issue #020 - Image Preload Operator: zero-second cold start, even for 8GB images

Ilia Gusev — Tue, 02 Jun 2026 14:01:16 GMT

Your inference pod schedules onto a fresh node. The image is 8GB. The pod sits in ContainerCreating for ninety seconds while the kubelet pulls it. Issue #15 was about why those ninety seconds are the way they are. This issue is about how to skip them.

This is the closer for the image-distribution series. Issue #3 looked at Stargz, which made cold start fast by being lazy - read what the container actually touches, ignore the other 94%. Issue #1 looked at Spegel, which turned every node into a peer and let the cluster share layers over its own network instead of hammering the registry. Both bet on a different shape of the same problem. Image Preload Operator makes a third bet, the most boring and the most effective one: have the bytes already on the node before any pod that needs them gets scheduled.

💎 Hidden Gem: Image Preload Operator

A DaemonSet that pulls images you haven't asked for yet

If you've ever run kubectl describe pod on a stuck inference workload and watched the Pulling image event sit there for over a minute, you already know the shape of the problem. The kubelet's pull is sequential, the registry is far, and your pod's startup latency is whatever number sits between the request and the first byte of the container being usable on disk.

The trick the operator pattern uses is not clever in the technical sense. It runs a DaemonSet on every node (or on a labeled subset), the DaemonSet calls into the container runtime - containerd, CRI-O, or Docker, whichever the cluster runs - and asks it to pull a configured list of images. The runtime stores those images in its local image cache, the same cache the kubelet would use anyway. When a pod for one of those images lands on the node later, the kubelet finds the image already present, sees imagePullPolicy: IfNotPresent, and skips the pull entirely. The container starts in whatever time it takes to set up cgroups and namespaces. For most workloads that's under a second.

The most popular implementation of this pattern is kube-fledged, which exposes the warm cache as a Kubernetes-native CRD called ImageCache. You write an ImageCache resource, the operator reconciles it into a Job that runs against the right nodes, the Job pulls the images, the operator tracks per-node status, and a kubectl get imagecache tells you whether every node in the pool has the bytes. There are a handful of other implementations - the OpenKruise project ships a similar primitive called NodeImage, and several teams just roll their own DaemonSet around a one-line crictl pull loop. They all have the same shape underneath.

What it's not

People keep filing it next to things it only resembles. A registry mirror is Spegel's job - the operator never sits in the pull path or proxies anything, it just kicks the kubelet's runtime into pulling early.

Stargz is the lazy-filesystem one, and the operator isn't that either. It doesn't touch how the image gets unpacked or read; the bytes land on disk exactly as they always would, and the only thing that shifts is the timing.

A baked AMI buries the image inside the node image itself, so the node has to be rebuilt whenever the image changes. The operator pulls dynamically instead. Push a new tag at noon and the next reconcile cycle lands it on every node, node image untouched.

Why the bet pays

The bet pays when image pulls are predictable. AI/ML inference is the textbook case. You run the same model server image on dozens of GPU nodes, the image is 6-12GB, and the cold start delta between "pull and run" and "already there, just run" is the difference between an autoscaler that responds in two minutes and one that responds in five seconds. The same logic holds for Spark executors and CI runner pools, or stateful databases that share a base image - anywhere the image set stays small and you know it ahead of time.

The bet doesn't pay when the image set is large and unpredictable. A multi-tenant cluster with five hundred different application images per node pool will not benefit from preloading - you'd burn the disk and most of the cached images would never be used. That's the Stargz case. Or the Spegel case, if you've got enough nodes that one will already have the image when another needs it.

The thing nobody mentions

The operator hands you a side benefit that sounds boring until you've needed it: a programmatic way to ask whether a given node has a given image. Once ImageCache.status carries per-node state, an admission policy can refuse to schedule a workload onto a node that's missing its declared image. Pre-flight checks before scaleup get easy the same way, and the Grafana panel that screams when warm-pool drift turns real basically writes itself.

Without the operator, that question lives in ssh-into-the-node-and-grep-crictl-images territory. The operator turns it into a kubectl get. Boring on a normal day - but I've reached for it at 3am more than once.

🔬 Trace: how the warm cache actually fills

The first ImageCache we wrote

Ours came out of a model rollout that kept missing its autoscaling target: new inference pods sat in ContainerCreating long enough that the request queue backed up before any of them were ready to serve. The fix was an ImageCache, and this is close to the one we started with:

apiVersion: fledged.kubefledged.io/v1alpha2
kind: ImageCache
metadata:
  name: ml-inference-models
  namespace: kube-fledged
spec:
  cacheSpec:
    - images:
        - registry.example.com/inference/llama-3-70b:v2.4.1
        - registry.example.com/inference/mistral-large:v1.8.0
      nodeSelector:
        node-role.kubernetes.io/gpu: "true"
    - images:
        - registry.example.com/runtime/triton:24.05-py3
      nodeSelector:
        node-role.kubernetes.io/gpu: "true"
  imagePullSecrets:
    - name: registry-creds

Writing it, the only real decision was the cacheSpec list, which maps image sets to node selectors. We pointed the model images at the GPU pool and kept them off everything else, so no node would burn disk on an image it was never going to run. Auth I'd braced for and it turned out to be nothing: imagePullSecrets is the same field the pod specs already used, so the private registry just worked.

Then we applied it and watched what happened. The controller picked up the new ImageCache, spun one Job per node-image pair pinned to its node, and each Job reached into that node's CRI socket and asked the runtime to pull. The part I hadn't expected was the bookkeeping. Every node wrote its result back into .status.nodes[], so checking whether the pool was warm became one query instead of an ssh-and-grep tour of the whole fleet.

Inside the Job

When I went digging into how the Job actually pulled, there was less to it than I'd assumed. The socket is the containerd one under /run/containerd on most of our nodes, or the CRI-O equivalent on the rest, and the DaemonSet mounts it as a hostPath volume and shells out to crictl pull. It's the exact code path the kubelet itself takes when a pod creates demand for an image, same socket and same content store. The only thing that changes is when it runs.

The question I kept circling back to was why a Job at all, instead of an init container in the workload pod. We tried the init-container version first. It pulled at pod-creation time, which was the one moment we were trying to get ahead of - we wanted the bytes on disk before the pod existed - and every replica ended up pulling on its own, with nowhere to look to see who was warm or to hold a scaleup until the cache caught up. The operator pulls before any pod exists and keeps every node's state in a single object, and that was the whole reason we moved off the init container.

The `kubectl describe` view, with and without

Without the operator, on a fresh node:

Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Normal   Scheduled  92s   default-scheduler  Successfully assigned ml/inference-7c4 to node-gpu-04
  Normal   Pulling    91s   kubelet  Pulling image "registry.example.com/inference/llama-3-70b:v2.4.1"
  Normal   Pulled     14s   kubelet  Successfully pulled image "registry.example.com/inference/llama-3-70b:v2.4.1" in 1m17s (1m17s including waiting)
  Normal   Created    13s   kubelet  Created container inference
  Normal   Started    12s   kubelet  Started container inference

77 of those seconds sat inside Pulling. The rest of the events are microseconds next to it.

The same pod, on a node the operator had already warmed:

Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Normal   Scheduled  3s    default-scheduler  Successfully assigned ml/inference-7c4 to node-gpu-04
  Normal   Pulled     2s    kubelet  Container image "registry.example.com/inference/llama-3-70b:v2.4.1" already present on machine
  Normal   Created    2s    kubelet  Created container inference
  Normal   Started    1s    kubelet  Started container inference

No Pulling event at all. The kubelet asks the runtime, the runtime says "already there," and the pod moves on. The 60-90 seconds Issue #15 spent dissecting are simply gone.

Where this falls over

Sounds clean. It is, right up until it isn't.

Image GC. We lost a 12GB Llama image to this on a Friday afternoon, and it took an embarrassing while to work out why. The kubelet runs its own garbage collector against the runtime's image store, governed by imageGCHighThresholdPercent (default 85%) and imageGCLowThresholdPercent (default 80%). Once disk on a node crosses the high watermark, the kubelet evicts unused images until usage drops back under the low one, and "unused" here means "not referenced by any running container." A freshly preloaded image, before any pod has landed on it, is exactly that: referenced by nothing. The GC was built to reap it. Tight disk plus a preloaded image, and the bytes you just paid for are gone before the workload that needed them ever schedules.

There's no clean fix upstream. The pragmatic move is a sentinel pause-container: a tiny pause pod per cached image so the GC counts it as in-use. kube-fledged ships this out of the box, and after I watched it save a node sitting at 91% disk that would otherwise have reaped its model image, I stopped thinking about GC thresholds at all. We still lower thresholds and oversize disks, but that's insurance against the wrong contract, not the fix.

Tag mutability. A CVE patch of ours quietly never reached production for two days, and preload was the reason. Push a new image under a tag that's already cached - rebuild nginx:1.25 overnight, say - and the nodes keep serving yesterday's bits. The kubelet sees the tag already present and asks no further questions, so the "rollout" becomes a no-op nobody thought to verify. Ours reported "all nodes cached" the entire time, while every node ran the vulnerable version. Preload by digest where you can, or wire a periodic re-pull on a cadence the security team owns; the :latest-is-evil argument from Issue #1 only gets sharper here, because preload makes the staleness sticky.

Pull storms on rollout. The first time you apply a large ImageCache, every node pulls every listed image at roughly the same moment. A hundred nodes and a 10GB image means a one-terabyte burst landing on your registry at once. Staging is the cheap mitigation: roll the ImageCache out to a subset of nodes, watch the registry breathe, then widen it. The better one is to pair the operator with Spegel, so the first node pulls from upstream and every other node grabs the layer from a peer over the cluster network.

The "warm cache + P2P mirror" pattern is the hybrid most teams who run this seriously end up at. We'll come back to it in the showdown.

🆚 Showdown: Stargz vs Spegel vs Preload

Three bets, one problem

Cold start latency is one problem with three philosophically different bets pointed at it.

Stargz (Issue #3) bets on laziness: don't pull what the container never reads. The image mounts as a lazy filesystem and bytes arrive from the registry on demand, so a container starts in a second or two even on a cold node, even at multi-gigabyte sizes. The price is FUSE in the I/O path and a standing dependency on the registry for chunks you haven't fetched yet.

Spegel (Issue #1) goes after locality instead. If the registry is the bottleneck, turn the nodes into the registry: each one serves layers it already has to its peers over the cluster network. The first pull anywhere still hits upstream, and everything after that runs at LAN speed.

Preload (this issue) is the boring one. The pull still happens, it just happens before a pod wants the image - by the time the scheduler picks a node, the bytes are already sitting there. The bill is disk and the ongoing chore of keeping ImageCache honest against what's actually deployed, and in exchange the whole latency tail disappears.

Where each wins, sharply

We've leaned on all three in production at one point or another, and the dividing lines turned out sharper than the project READMEs let on.

Stargz is the right call when images are small and land on many different nodes for short jobs, the CI-runner and serverless-backend end of the spectrum. The image set there is wide and shallow, you can't predict what to preload, and lazy loading is the only thing that keeps up.

Spegel earns its place on large clusters where the same image set rotates across hundreds of nodes: multi-tenant platforms and big SaaS fleets, where you're already paying for inter-node bandwidth and the registry has quietly become the bottleneck. Once one peer has a layer, the marginal cost of the next node pulling it falls to almost nothing.

Preload, the one this issue is about, pays off in the predictable case: the same big images going to the same nodes over and over. That's our AI inference fleet and the GPU training pools, plus Spark jobs and the stateful databases we keep on dedicated nodes. The set is narrow and stable, a fast cold start is worth real money, and the pre-pull cost can run off-hours when nobody's watching.

The hybrid that actually ships

Most teams that run this at scale don't pick one. They pair Preload with Spegel.

The first time the operator pulls an image, one node in the cluster talks to the upstream registry, pulls the bytes, and caches them. Spegel indexes that node's layers and announces them. When the operator's DaemonSet on every other node starts its pull, Spegel intercepts the request, sees that a peer already has the layer, and serves it over the cluster network instead. The registry sees one pull instead of a hundred. The cluster gets warm everywhere in the time it takes to copy bytes between two nodes over a 10Gbps NIC.

Stargz fits in as a third layer for the long tail. Workloads that don't fit your ImageCache declaration - because they're new, or one-off, or some tenant pushed something the platform didn't know about - still start fast because Stargz makes the cold pull lazy. You've spent zero extra operational effort and you've turned the cold-start tail latency from a multi-minute outlier into a sub-second curve.

That's where the series lands. Issue #3 was the smartest single technique, Issue #1 the smartest distribution model, and this one is just the bluntest instrument in the drawer: have the bytes there already. Put all three together and the "first 60 seconds" problem from Issue #15 stops being a problem at all.

Links

Stargz Snapshotter
Spegel: cluster-local OCI registry mirror
Podo Stack: Issue #3 - Lazy Pull, Smart Scale (Stargz)
Podo Stack: Issue #1 - Spegel, Pixie, and why :latest is evil
Podo Stack: Issue #15 - a pod's first 60 seconds

Issue #21 picks up a parallel thread. If you can preload the bytes onto the node, the next question is what happens when the node itself is the image. Talos and the immutable-OS school of thought treat the whole host as an artifact you replace rather than configure. The arrow keeps shortening, and the cold start keeps moving upstream.

- Ilia

Postgres autovacuum: why your 200GB table won't release space

Ilia Gusev — Fri, 29 May 2026 14:01:21 GMT

The first time we got paged on a 200 GB table that wouldn't shrink, it was an events table sitting at around 400 million rows. Disk dashboard was yellowing, autovacuum was running clean - pg_stat_user_tables showed worker passes completing, log lines free of errors, last_autovacuum updated an hour ago. Somebody from the DBA chat asked why we didn't just run a VACUUM. We did. Nothing changed.

Every signal looked healthy. The only thing wrong was the outcome. Autovacuum was doing exactly what it's supposed to, and the table was still 200 GB - both true at the same time. Once we saw why, we stopped chasing the wrong fix. Issue #17 covered Postgres on Kubernetes at the cluster level. This is what happened inside one instance under write load, before any of that mattered.

The query that misled us first

We started with the obvious one and immediately got pointed in the wrong direction by it.

SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, autovacuum_count
FROM pg_stat_user_tables
WHERE relname = 'events';

n_dead_tup came back at 80 million on a 400-million-row table. Our first read was that vacuum was broken. It wasn't. That number isn't a count of bloat, it's a count of tuples vacuum decided it couldn't remove on that pass. The interesting question turned out to be why it couldn't.

We learned the hard way that the disk-not-shrinking part has two layered causes. One is a misconception about what VACUUM actually does to disk. The other is the visibility horizon - the thing silently holding vacuum back without showing up as an error anywhere. We peeled them in that order on the second outage and got to the real fix in an afternoon instead of a week.

Why MVCC leaves corpses behind

Postgres uses MVCC - multi-version concurrency control - which means writes never overwrite a row in place. An UPDATE writes a new physical tuple and marks the old one as superseded. A DELETE marks the existing tuple as deleted rather than removing it. The old tuple keeps sitting in the same page on disk, taking up the bytes it took yesterday.

Two hidden system columns track this on every tuple: xmin, the transaction ID that created the tuple, and xmax, the transaction ID that invalidated it via UPDATE or DELETE. A tuple with xmax = 0 is still live; once xmax is populated the tuple is dead, but only from the perspective of transactions that started after that xmax committed. When we peeked at one of our hot rows with the pageinspect extension, we saw exactly what the docs describe: the new version with one xmin and the old version still sitting there with a populated xmax, waiting on cleanup.

The reason for the design is write throughput. An UPDATE doesn't rewrite the row in place or worry about concurrent readers seeing a half-written tuple - it appends a new tuple, flips a header bit on the old one, commits, and moves on. Readers in older transactions see the old version; readers in newer transactions follow the chain to the new one. Everyone gets a consistent view at the moment their transaction started, no lock blocking a high-traffic row.

The price is that nothing has actually been freed. On a table where we were updating the same row a hundred times an hour, we'd accumulate a hundred dead tuples per row per hour, all living on the same page until vacuum got around to them. The 200 GB we'd been paged on was mostly that gap - dead tuples vacuum hadn't been allowed to remove yet.

The discovery that VACUUM doesn't free disk

I'll admit we ran VACUUM by hand three times before anyone opened the docs - each run finished clean in the logs, and each time the disk graph sat flat. The reason turned out to be mundane: VACUUM does not hand pages back to the OS. It marks the space the dead tuples held as free, and that free space is only ever reused by later inserts and updates into the same table.

So autovacuum walks the table and writes the location of every dead tuple it finds into the free space map. The next INSERT grabs one of those slots instead of growing the file. When the write rate matches the rate dead tuples pile up, the table parks at one size and sits there. Ours hadn't gotten there by accident - the events table took a one-time backfill six months before the page, then dropped to a fraction of that write rate, so it was frozen at the backfill high-water mark with no fresh writes to refill the holes.

This is the right tradeoff. Truncating a file requires the dead space to be at the end of the file, and in a heavily-updated table dead tuples are scattered across every page. Actually shrinking the file means rewriting the table, which locks it. VACUUM does the cheap thing and leaves the rewrite as a separate operation you have to ask for - either VACUUM FULL, which takes an ACCESS EXCLUSIVE lock and ran for hours on our 200 GB table, or pg_repack, which does an online shadow-table swap but needs roughly 2x disk to run.

When someone in chat says "autovacuum isn't reclaiming disk", the technically-correct answer is that it was never going to. The right question is whether the table sits at a sensible steady-state size for the write pattern. For us the answer was no - the table was bloated way past steady state because vacuum hadn't been able to mark enough dead tuples reusable. Which led us to the horizon.

The visibility horizon: where the real bug was

Vacuum can only remove a dead tuple if no active transaction could still need to see it. That sounds obvious until you trace what "active transaction could still need to see it" actually means in a busy Postgres instance, which is what we spent a Tuesday afternoon doing.

Every transaction starts with a snapshot of the database at the moment it began. As long as that transaction is open, its snapshot pins the visibility of every dead tuple whose xmax is newer than the snapshot's view. Vacuum sees those tuples, checks the oldest active snapshot, and skips them - removing them would corrupt the view of a transaction still running.

The horizon - the oldest snapshot held by anything - turned out to be the single most important thing to look at when vacuum looks fine but disk doesn't shrink:

SELECT max(age(backend_xid)) AS oldest_xid_age,
       max(age(backend_xmin)) AS oldest_xmin_age
FROM pg_stat_activity
WHERE state <> 'idle';

age() measures how many transactions have happened since the given XID. Our oldest_xmin_age came back at 47 million. Something was holding a very old snapshot, and that something was the silent culprit. We later found the same shape - oldest_xmin_age in the millions - in roughly half the autovacuum-bloat tickets our SRE team had filed that year.

The actual hold turned out to be one of four sources, and we've since seen each at least twice in production. The most obvious case is a long-running transaction left open - a reporting query running for two hours, or a migration script that opened a transaction and forgot it. Sneakier are idle-in-transaction sessions, where a connection started a transaction, ran some queries, then the app went off to do something else and never committed:

SELECT pid, state, query, now() - xact_start AS age
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY age DESC;

The transaction is open. The snapshot is pinned. Vacuum can't touch tuples newer than it. We've watched idle in transaction sessions with an age of hours in apps that aren't careful about pooling and transaction boundaries.

Replication slots got us once and were the most operationally tricky of the four. A logical or physical slot tells Postgres not to discard WAL and not to advance the horizon past the consumer's position. A replica we'd torn down months earlier had left its slot behind, and the primary had been holding WAL and pinning the horizon for it ever since:

SELECT slot_name, slot_type, active, restart_lsn, confirmed_flush_lsn,
       age(xmin) AS xmin_age
FROM pg_replication_slots;

A slot with active = false and xmin_age in the tens of millions is a dead replica quietly killing your vacuum. The fix was pg_drop_replication_slot after confirming the replica really was gone.

Prepared transactions are the rare but devastating one. Two-phase commit transactions can be prepared but not committed - they sit in pg_prepared_xacts until someone explicitly commits or rolls them back. A forgotten prepared transaction holds the horizon indefinitely. If pg_prepared_xacts has anything in it and nobody knows why, that's your answer.

None of these show up in the autovacuum logs. Vacuum runs, vacuum logs success, vacuum quietly skips tuples it can't remove, and the only visible symptom is n_dead_tup climbing.

What the fix actually looked like

Order matters here. We had to diagnose the horizon first, then decide on cleanup tactics. A team across the hall had done it the other way the previous quarter - VACUUM FULL in a maintenance window without checking the horizon - and their table was bloated again in a week because the hold-back was still in place.

Our first concrete action was the horizon check from the previous section. We killed two idle-in-transaction sessions older than our typical query time, dropped the stale slot from the torn-down replica, and confirmed pg_prepared_xacts was empty. An hour after the next autovacuum pass, n_dead_tup finally started dropping for the first time in weeks.

Once the horizon was clean, autovacuum finally had permission to do its job - but it still wasn't keeping up with how fast that table took writes. The parameter we reached for was autovacuum_vacuum_scale_factor. It ships at 0.2, so a table doesn't even become eligible for autovacuum until dead tuples cross 20% of the live count. Do that math on 400 million rows and vacuum sits on its hands until 80 million tuples are already dead. For a table that hot, we wanted it twitchier:

ALTER TABLE events SET (autovacuum_vacuum_scale_factor = 0.02);

At 2% the table became eligible roughly ten times sooner. Each pass had less to clean, so instead of n_dead_tup sawtoothing up into the hundreds of millions and back, it stayed in a narrow band we could actually reason about. On a payments-log table we own, we've since gone down to 0.01 and added autovacuum_vacuum_threshold as a fixed-row trigger on top.

The rewrite came at the end, only because steady-state wasn't the size we wanted. Our events table should have been around 80 GB given the new write pattern, but it was at 200 because of the historical bloat - reusable space the table would never consume again. pg_repack was the right tool there: online, transparent, left the table at its actual minimum size after a weekend run. VACUUM FULL gets you the same outcome with a long exclusive lock - the right choice when you have a maintenance window and don't want to install an extension.

While the rewrite was running we watched progress directly:

SELECT pid, phase, heap_blks_scanned, heap_blks_total,
       num_dead_tuples, max_dead_tuples
FROM pg_stat_progress_vacuum;

On a different table we owned, vacuum had been stuck in vacuuming indexes for hours - a huge fragmented index, and the vacuum pass was mostly index work. That was a separate problem with its own fix: REINDEX CONCURRENTLY on the affected indexes, run during a low-traffic window.

What we keep getting wrong

The same patterns kept showing up in the autovacuum-bloat-200-GB tickets after that first incident, and most of them came back to the same misreads:

Running VACUUM FULL in a maintenance window without checking the horizon. The table shrinks for a day, then bloats again because the idle-in-transaction session keeps leaking from the same broken app. We did exactly this on a different table six months later, before we'd internalized the order.
Reading n_dead_tup as a measure of bloat. It's a measure of what vacuum can't remove right now. Real bloat estimation needs the pgstattuple extension - we now run a pgstattuple sweep weekly across the top ten tables by size.
Ignoring idle in transaction sessions because they aren't running queries. They aren't idle. They're holding a snapshot.
Leaving stale replication slots from torn-down read-replicas. Every team we've talked to has at least one. Cheapest fix for the most expensive symptom we've seen.
Never tuning autovacuum_vacuum_scale_factor for large hot tables. The 20% default is fine for tables with a few thousand rows and absurd for tables with hundreds of millions.
Tuning autovacuum_naptime instead of scale factor. Naptime controls how often the launcher wakes up. Scale factor controls when a specific table becomes eligible. Most "vacuum doesn't run often enough" complaints we've debugged turned out to be scale-factor problems in disguise.
Assuming autovacuum behaves the same on a primary and a hot standby. With hot_standby_feedback = on, long queries on the replica hold the primary's horizon back too - a separate failure mode that only appeared in our setup once we started routing read traffic to the replica.

Links

PostgreSQL docs: pgstattuple - the extension that gives you actual bloat numbers instead of n_dead_tup.
PostgreSQL docs: hot_standby_feedback - the replica-side parameter that can stall the primary's horizon.
PostgreSQL wiki: VACUUM FULL vs CLUSTER vs pg_repack - tradeoffs across the three full-rewrite options.

This is the first in a Postgres-fundamentals mini-arc. Storage was today. Next in the queue: how the planner picks the plans it picks, what transaction isolation actually buys you under concurrent writes, and when JSONB is the right column type versus a documented mistake. The 200 GB table is just where the iceberg pokes above the water.

Pod probes: the liveness check that restarts healthy apps

Ilia Gusev — Wed, 27 May 2026 14:02:13 GMT

The pod with restart count 47 was running fine. It was a payment-edge service we'd been on call for since the rewrite, and the dashboards said latency was healthy, error rate was at the usual weekday floor, throughput was on the seasonal curve.

The only thing wrong with the pod was the kubelet's view of it: by Wednesday lunch the kubelet had killed and restarted that pod 47 times in three days, and we'd missed every single restart because the next pod was up in eleven seconds and our alerts were tuned to "down for >60s" because of an unrelated noise problem the year before.

That number is why I now spend more time on probe configs than on any other YAML in our clusters. The pod was not broken. The liveness probe was. (Issue #15 covered the first 60 seconds of a pod's life; the W1 Friday evergreen covered the last 30. This is the liveness loop that runs in between.)

The first time we caught it

We caught the GC-pause version first, on the JVM payment service. The events on kubectl describe pod after the 47th restart came back with the line we'd seen a hundred times and never read closely:

Events:
  Type     Reason     Age                  From     Message
  ----     ------     ----                 ----     -------
  Warning  Unhealthy  2m (x14 over 5m)     kubelet  Liveness probe failed:
    Get "http://10.0.2.7:8080/healthz": context deadline exceeded
  Normal   Killing    2m                   kubelet  Container app failed
    liveness probe, will be restarted

context deadline exceeded means the probe timed out before the app answered - not that the app was dead, just that it didn't say "I'm here" inside one second. Our config was the one our Helm chart had been shipping since 2022:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  timeoutSeconds: 1
  failureThreshold: 3

That's a thirty-second budget before the kubelet acts: three consecutive failures, ten seconds apart, one-second timeout each. Thirty seconds sounds generous, and we'd thought of it as generous for years. It isn't, when worst-case GC on the JVM in question is a two-or-three-second stop-the-world.

The G1 collector that Wednesday was doing a mixed collection on a heap that had been growing toward its target since the previous deploy. Two pauses landed inside the same thirty-second window, three probes timed out in a row, kubelet sent SIGTERM, pod restarted, the new pod's heap immediately started growing toward the same target, and the same GC pattern lined up against the same probe schedule a few hours later. Forty-seven times across three days.

What we were throwing away each time was a process with a warm JIT-compiled hot path - the replacement was always going to GC worse than the one we'd just killed.

The fix that day was two lines: timeoutSeconds: 5, periodSeconds: 15. The deeper fix took a quarter and a different incident before we got around to it.

The second time was worse - it killed the deployment

The second incident hit a Postgres-backed checkout service. Someone had added a SELECT 1 to the /healthz handler during a different incident the year before - a "connection check" that nobody had revisited. Postgres started a routine autovacuum on a 200GB table, query latency climbed to four seconds, every /healthz request waited four seconds, every probe timed out, every pod in a thirty-pod deployment failed liveness inside the same thirty-second window.

The kubelet has no coordination between nodes. Each kubelet, on its own pod, independently decided that liveness had failed three times and the container had to restart. Inside about a minute the entire deployment was being restarted in parallel. What came back online was thirty cold processes reconnecting to a Postgres that was already under vacuum pressure, which extended the incident by another six minutes after we'd figured out the trigger.

The fix was four lines: drop the Postgres check from /healthz entirely, move it to a separate /ready endpoint that the readiness probe (not the liveness probe) was already pointing at. Readiness failing doesn't restart anything - it just pulls the pod's IP out of Service endpoints for as long as the check is failing. A pod whose readiness is red for ninety seconds and then recovers is, from the kubelet's perspective, fine - the container was never touched.

That distinction - readiness pulls traffic, liveness restarts - is something I'd been able to recite for years before I learned what it actually meant in production.

What we'd been telling the kubelet to do all along

After the second incident we sat down and traced what each probe was actually doing on the kubelet's side. Three probes in the API, all configured the same way in YAML, all of which we'd been treating as interchangeable health checks. They are not.

What we'd missed in our mental model was that a failed liveness probe goes through the container's full shutdown contract. The kubelet runs the preStop hook if there is one, sends SIGTERM, waits up to terminationGracePeriodSeconds, then SIGKILL. Restart count goes up by one. The pod stays on the same node and the same volumes.

If liveness keeps failing for the same reason, the kubelet keeps restarting - there's no exponential backoff for "this pod is in a probe-induced loop", just the regular crash-loop backoff after the kubelet has tried a few times. We'd been imagining the kubelet as smarter about probe loops than it actually is.

Readiness has nothing to do with the container at all, which is the part I had wrong for years. When readiness fails, the endpoint controller (the one that maintains EndpointSlice objects behind Services) removes the pod's IP from the slice. Traffic stops being routed to it. The container itself is not touched - no SIGTERM, no restart, no event in the pod's history.

When readiness succeeds again, the IP goes back in. A pod in our cluster can flap ready/not-ready for hours and kubectl get pods will keep saying it's Running, because it is.

Startup is the one we'd never set on any of our services until after the third incident, which I'll get to in a minute. While startup is running, the kubelet polls it on its own schedule and ignores liveness and readiness completely. The first time startup returns 200, the kubelet stops polling it forever, marks startup as done, and the other two probes take over.

If startup never succeeds within its budget (failureThreshold × periodSeconds), the container restarts. It's the kubelet's way of saying "this app boots slowly and that's allowed, but it doesn't get to boot forever".

The thing that surprised me when I finally read it carefully was that there's no built-in "is this app healthy" probe. Liveness is "does the process respond to a probe at all". Readiness is "should we route traffic to it right now". Neither answers the question most teams think they're asking, which is "is the app fine".

The third incident we never had: slow boots

The third one we caught before it shipped. We were rolling out a new Spring Boot service whose schema migrations on startup were taking close to four minutes when the database was busy. Default liveness probe, no startup probe, initialDelaySeconds: 60 because that's what the Helm chart shipped.

The pod would come up, fail liveness inside the first thirty seconds of running (because the app wasn't answering yet), the kubelet would restart it, and the new pod would also fail in the same window. Restart loop before the first deploy was even fully rolled out.

Before startup probes existed (added in 1.16), the answer was to keep cranking initialDelaySeconds higher. The shape of the problem is that boot time isn't a constant - it depends on what the node has cached, whether the image was already pulled, how busy the database is, whether JIT warmup has happened yet, whether sidecars are ready. Pick a number too low and you restart-loop on slow days. Pick a number too high and every deploy wastes the difference.

Startup probes are the proper fix. Here's what we shipped that day:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 5
  failureThreshold: 60
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

The startup probe gets a five-minute budget (60 polls × 5 seconds) of slow polling. The moment /healthz returns 200 once, startup is done and liveness takes over with its tighter thirty-second budget. A pod that boots in 20 seconds isn't waiting an extra 40 for initialDelaySeconds to expire; a pod that boots in 250 seconds doesn't get killed mid-migration.

Startup and liveness point at the same endpoint on purpose - two endpoints means two diverging definitions of "alive" after the next refactor, and the divergence will not be caught in code review.

That config has been our default on JVM services ever since. The non-JVM ones get a shorter startup budget tuned to whatever the actual cold-boot p99 is.

When we tried exec and went back to httpGet

We used exec probes on the Redis-backed services for about two years. redis-cli ping inside the container as the readiness check, on a 5-second period, because the Redis client we were using didn't expose connection state in a way HTTP could query cheaply.

It worked fine until we landed on a node packing 90 pods of various kinds, where the kubelet's fork-exec cost from probe churn became visible in node CPU graphs - not the dominant cost, but a measurable few percent that hadn't been there before.

We switched to an in-process Redis health endpoint exposed over HTTP, with the probe doing a regular httpGet. CPU on the busy nodes dropped by about a third, which is more than the probe arithmetic alone would have predicted (the cliff was sharper than the slope of incremental pods, because at some point the kernel runs out of headroom and every fork pays for it).

We've kept exec only for the small set of cases where the check has to read something inside the container's filesystem that the app itself doesn't expose - a flag file, a CLI tool that already shipped in the image and exits 0 when healthy.

The tcpSocket handler is the one we use the least. It opens a TCP connection and closes it - if the listener accepts, the probe passes. Cheap and uninformative: I've seen a process deadlocked in a CPU-burning loop while the kernel cheerfully held the listening socket open and tcpSocket happily passed.

We use tcpSocket only on startup probes for things that don't speak HTTP yet at boot (a worker that needs to come up on its TCP port before it begins doing useful work), and even there I push the team to add a real httpGet for liveness and readiness as soon as the HTTP layer is up.

The order we run when someone hands us a restart loop

A new team brings us a pod that's restart-looping, the order goes the same way every time. We start at kubectl describe pod and read the Events section at the bottom - the kubelet leaves a Warning event for every probe failure with the actual response or error attached, which is the difference between guessing and not guessing.

What's in that message is most of the answer. The one we've seen most often by far is context deadline exceeded, where the probe got out to the pod but the handler didn't answer inside timeoutSeconds - almost always a slow handler under load (often the same /healthz doing too much), not an actually dead process. In the last quarter we hit this five times: three turned out to be Postgres queries hiding inside health endpoints, the other two were JVM GC spikes during peak.

The other strings we sometimes see are HTTP 500, when the handler answered and chose to fail (usually because it's checking a downstream dep it shouldn't be), and connection refused, when the listener isn't up yet (usually no startup probe, app being polled before it's ready).

When the per-pod events don't make it obvious, we run kubectl get events --field-selector reason=Unhealthy -A and look at the cluster-wide picture. A deployment where every pod is failing the same way at the same time points at a shared dependency the pods are talking to, not at any one pod's process - that was the Postgres-vacuum incident I described above, and we've watched the same shape play out at least three more times since.

When the kubelet's events aren't conclusive, we exec into one of the failing pods and curl the probe endpoint from inside with our own timeout. If our curl returns 200 in 200ms while the kubelet's probe was timing out, the issue is either timeoutSeconds set too tight or something on the path between kubelet and pod that doesn't show up in app metrics - we've debugged a conntrack table fill that looked exactly like this.

If our curl reproduces the slowness, the handler is the problem and we go read the handler code.

What our review process started rejecting

The patterns we now reject in probe-config review, drawn from the incidents we've actually had:

Anything liveness-related that touches a network dependency outside the process. The second incident I described above was exactly this shape - our /healthz was talking to Postgres, Postgres got slow, the whole deployment restarted. Liveness has to live inside the process's address space. Readiness can check downstream deps if we want to gate traffic on them, because readiness failing is recoverable without a restart.
Same endpoint for liveness and readiness. They're answering different questions and serving them off the same path means we'll get restarts caused by downstream issues we never intended to restart for. The split is two extra YAML lines per Deployment and we've never regretted it.
JVM service without a startup probe. The default initialDelaySeconds was tuned for a much faster era of Java; the Spring Boot apps we ship boot in tens of seconds on a good day and minutes on a bad one, so we've standardised on startup probe with failureThreshold: 30, periodSeconds: 5 across the JVM fleet.
exec probes when httpGet would do. We don't reject these on principle, but the reviewer has to be convinced the check can't reasonably be exposed over HTTP from inside the same process. The 90-pods-per-node episode burned us once and we'd rather not repeat it.
timeoutSeconds: 1 on a handler whose p99 is above 700ms. The default is one second and we've found the default is wrong for most real services - it leaves no headroom for the kinds of slow days the probe is supposed to tolerate. We set the timeout to p99 plus a comfortable buffer, not a round number that looked nice in the original copy-paste.
Background workers with no probes at all. A queue consumer that wedges on a poisoned message stays wedged until someone notices it from the consumer-lag dashboard, which has historically been a customer complaint several hours late. A liveness probe pointed at a process-internal "am I still making progress" counter catches it inside one probe cycle, and that's what we now require on anything queue-driven.

Pod start was the quietest performance bug. Pod shutdown was the quietest correctness bug. The probe loop between them is both - quietly restarting healthy apps, silently failing readiness on unhealthy ones, until someone sits down with kubectl describe pod and reads what the kubelet has been saying out loud the entire time.

The sprint we spent on probe configs across the top ten deployments paid back the next quarter in fewer pages, smaller 5xx bands during incidents, and the deletion of three "auto-remediation" runbooks that turned out to be unnecessary once the probes were doing what we'd intended them to do.

- Ilia

Issue #019 - Service account tokens: the expiry that breaks your CI on weekends

Ilia Gusev — Tue, 26 May 2026 14:01:21 GMT

Pager goes off at 3:07 on a Saturday morning. The alert is CIBuildFailureRateAbove50pct. It's been firing for nine minutes by the time anyone looks at it, because the only on-call awake enough to read Slack is in Berlin and the rest of the team is in two American time zones that are still asleep. Every CI build that started in the last hour has died with 401 Unauthorized from the Kubernetes API. Nothing in the cluster has been deployed since Thursday. Nobody pushed a config change. The cluster, by every dashboard, is green.

This is the story of the next four hours, what the on-call kept missing, and why a token nobody had touched in two years finally expired.

Debug Story: the Saturday the cache won

03:07 - the page

The first thing the on-call did was the thing everyone does. kubectl get nodes. All Ready. kubectl get pods -A | grep -v Running returned nothing scary. The control plane was healthy. Etcd metrics were boring. The alerting rule was right that CI was broken, but the cluster wasn't.

Second instinct was the CI namespace. The build pods were crash-looping on a roughly twelve-second cadence and the kubelet was already past its second restart-backoff bump on most of them. Logs from the failed pods all ended the same way:

error: failed to retrieve secrets from kubernetes api:
  Unauthorized

So the failure wasn't scheduling or container start. Something inside the build pod was talking to the Kubernetes API and getting told no.

By 03:14 the on-call had a working theory: the API server was rejecting their auth, probably an RBAC change someone pushed late on Friday. They paged the SRE lead. Half-asleep, the lead asked them to dump events sorted by lastTimestamp. What came back was a wall of restart warnings about the CI pods and, aside from that, nothing useful - no RBAC denial, no admission webhook complaint pointing the on-call at anything to actually go fix.

03:42 - the false lead

By 03:42 they were in the audit logs. The cluster had API audit logging on, dumped into Loki, and the on-call typed in a query for GET requests against secrets coming from any service account in the ci namespace:

{cluster="prod-1"} | json
  | verb = "get"
  | objectRef_resource = "secrets"
  | user_username =~ "system:serviceaccount:ci:.+"

The hits came back as 401s with the message "Unauthorized: token expired".

That should have been the moment. It wasn't — and the reason it wasn't is the kind of thing you only see in hindsight.

The on-call read token expired and went to look at the Secret that holds the CI service account's token. They ran kubectl get sa builder -n ci -o yaml and saw a secrets: field pointing at a Secret called builder-token-xxxxx. They ran kubectl get secret builder-token-xxxxx -n ci -o yaml, base64-decoded the token, pasted it into jwt.io, and saw an exp claim from 2024. The token was eighteen months old. It wasn't going to expire.

So the cluster was rejecting a token that, on its face, was still valid. That sent the investigation in a different wrong direction for about forty minutes. Was someone rotating the cluster's signing key mid-incident? Was a mutating webhook eating the auth header? Both turned out to be a waste of forty minutes.

04:31 - the operator nobody owned

Around 04:31 the SRE lead joined the call and asked a question the on-call hadn't asked: which container inside the build pod was actually making the API call? The build job itself didn't talk to the K8s API directly. It called a helper service in the same namespace, an internal operator called ci-secret-resolver that fetched secrets from various places (Vault, AWS Secrets Manager, K8s Secrets) and exposed them to the build through a unix socket.

The on-call had never thought about ci-secret-resolver because it was old. It had been deployed by someone who left in 2023, it was a single Deployment with one replica, it didn't have an owner team, and it never broke. The Helm chart that managed it pinned the image to a digest from two years ago.

They kubectl exec-ed into the resolver pod and ran curl -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://kubernetes.default/api/v1/namespaces/ci/secrets. It came back with the secret list. The token in the file worked.

Then they looked at the resolver's own logs. The resolver was logging the Authorization header it was sending (poor hygiene, but lucky for this debug). The token it was sending was different from the one in the file. Different iat. Different exp. The one in the file had been written ten minutes ago. The one the resolver was sending had been issued at 04:11 Friday morning, expiring at 05:11 Friday morning.

The resolver had read the token file once, at pod start, twenty-three hours and fifty-six minutes ago. It had cached the bytes in memory and was still sending them. The token's exp had passed at 05:11 Friday. From that point on, every call the resolver made to the API server returned 401. CI didn't notice until the next batch of builds started Saturday morning, because the only Friday builds had been before 05:11 and the cluster was quiet over Friday night.

04:47 - the fix and the realization

Killing the resolver pod resolved everything in about ninety seconds. The new pod read the current token from the file, started using it, CI started passing. The fix was that small.

The realization was bigger. The resolver had been deployed against a Kubernetes 1.21 cluster, where service account tokens were issued without expiry as long-lived bearer tokens. The cluster had been upgraded to 1.24 in 2024, which silently switched to BoundServiceAccountTokenVolume and projected tokens with a one-hour TTL. The resolver kept working for a year and a half because every time the resolver Pod restarted (deploys, evictions, node rotations), it picked up a fresh token. Nothing had restarted the resolver in twenty-four hours, which was the first time that had ever happened. Stability had become a bug.

The on-call wrote it up. Three things in the postmortem:

The resolver, and probably others like it, was reading the SA token once at startup. That contract was wrong on any modern Kubernetes.
The cluster had no detection for "pods using a token older than its own expiry." That should be a dashboard.
There was no inventory of which operators in the cluster were old enough to predate bound tokens.

The third one was the scary one. They didn't know how many other resolvers were out there. Item three was the actual root cause. Items one and two were symptoms of nobody asking the question.

Trace: how kubelet, the API server, and the operator each see the token

What KEP-1205 changed

The mechanics that bit the resolver come from KEP-1205, Bound Service Account Tokens, beta in 1.21, default-on in 1.22. Before KEP-1205, every ServiceAccount had a Secret of type kubernetes.io/service-account-token with a JWT inside that had no exp claim. The token was eternal. The kubelet mounted that Secret into pods at /var/run/secrets/kubernetes.io/serviceaccount/token. Anyone who exfiltrated that token kept it forever.

KEP-1205 replaced that mount with a projected volume containing a serviceAccountToken source. The projected volume isn't a Secret. It's a virtual mount that the kubelet writes to directly, with a token requested from the API server's TokenRequest endpoint. The token has an aud (audience), an exp (default one hour), and an iat. The kubelet refreshes the file before expiry. The pod sees the same path it always did, but the bytes change.

A modern pod spec, even one you didn't write, has this hidden in it. Run kubectl get pod some-pod -o yaml and look under spec.volumes:

- name: kube-api-access-abc12
  projected:
    defaultMode: 420
    sources:
    - serviceAccountToken:
        expirationSeconds: 3607
        path: token
    - configMap:
        items:
        - key: ca.crt
          path: ca.crt
        name: kube-root-ca.crt
    - downwardAPI:
        items:
        - fieldRef:
            apiVersion: v1
            fieldPath: metadata.namespace
          path: namespace

The admission controller ServiceAccountTokenVolumeProjection adds this projection automatically to every pod that has a ServiceAccount, which is every pod. The expirationSeconds: 3607 is hardcoded by the controller (3600 plus a small jitter). You don't set it. The pod author doesn't see it. It's just there.

What the kubelet actually does

The kubelet has a goroutine per projected token that watches the token's expiry and refreshes when the token has 20% of its TTL left, or whenever the kubelet itself restarts. The refresh path is straightforward: kubelet calls TokenRequest against the API server, gets back a new JWT with a fresh exp, atomically rewrites the file in the projected volume. The path the pod mounts at startup keeps pointing at the same file, but the bytes inside that file rotate roughly hourly - anything that reads the file once and caches the bytes will quietly fall behind.

With --v=4 on the kubelet, the refresh leaves these breadcrumbs:

I0517 04:11:24.317894 1 reconciler.go:268]
  operationExecutor.MountVolume started for volume "kube-api-access-abc12"
  (UniqueName: "kubernetes.io/projected/pod-uid-1234-kube-api-access-abc12")
  pod "resolver-7c9d-xyz12" (UID: "pod-uid-1234")
I0517 05:01:12.401829 1 projected.go:241]
  ServiceAccountToken refreshed for pod resolver-7c9d-xyz12,
  new expiration 2026-05-17 06:01:12 +0000 UTC

Refresh fired on schedule at 05:01, and the file on disk was good through 06:01. The resolver, sitting in user space inside the container, hadn't reopened that file since pod start. Its in-memory copy was still the 04:11-issued token that had already expired at 05:11. From then on every call it made to the API server came back 401 - against a current token sitting unread two file descriptors away.

How the API server sees it

When the resolver made a call to kubernetes.default, the API server ran the TokenAuthenticator chain - bootstrap tokens first, service-account tokens second. The ServiceAccountToken authenticator parses the JWT, checks the signature against the cluster's signing keys, and then validates exp. The relevant code path in the apiserver is pkg/serviceaccount/jwt.go, which calls claims.ExpiresAt and rejects anything in the past.

The audit log entry from the on-call's investigation, decoded, had user.username: system:anonymous, responseStatus.code: 401, and the annotation authentication.k8s.io/legacy-token-expired: "true". The system:anonymous was the giveaway. The token failed validation, so the request fell through to the anonymous authenticator, which the API server still runs by default for /healthz and a few other paths. Anonymous can't get secrets, so the response is 401. The user record on the audit entry is anonymous, not the resolver's service account. That's why the on-call's first Logql query had to filter by URL path and not by user.

Why the file watch wasn't there

Client-go has had a helper for this since 2020. transport.NewCachedFileTokenSource reads the token file on each call (with a small cache to avoid hitting the filesystem every time) and produces a fresh Bearer for each request. The standard rest.InClusterConfig() uses it. Any client built on the standard helper would have been fine.

The resolver was older than the helper. It was written against client-go 0.18 and hand-rolled its auth:

token, err := os.ReadFile("/var/run/secrets/kubernetes.io/serviceaccount/token")
if err != nil { return nil, err }
return &http.Client{
    Transport: &authedTransport{token: string(token)},
}, nil

That's the whole bug, in five lines. os.ReadFile once, hold the bytes, never look again. A fix is the same five lines, with the read moved inside authedTransport.RoundTrip. Or, less invasively, swap the constructor for transport.NewCachedFileTokenSource("/var/run/secrets/kubernetes.io/serviceaccount/token") and let client-go do the right thing. External clients (CI, ArgoCD, anything outside the cluster) use the TokenRequest API instead, which mints a fresh short-lived JWT per call - no file to cache, so this whole bug class is impossible by construction there.

Policy: detecting cached-token operators before they bite

The first move after the incident was an inventory pass. Three layers, cheap to run.

Layer one: find pods using projected SA tokens at all

Easy filter, gives you the universe of candidates. Every pod that has a ServiceAccount has one of these mounts unless automountServiceAccountToken: false is set explicitly.

kubectl get pods -A -o json | \
  jq -r '.items[] | select(.spec.volumes[]?.projected.sources[]?.serviceAccountToken) |
    "\(.metadata.namespace)/\(.metadata.name)"'

In a normal cluster this is almost every pod. That's not interesting on its own. What's interesting is which of those pods are old or unmaintained.

Layer two: find pods running long enough to have refreshed

A pod that's been running longer than the token TTL has gone through at least one kubelet-driven refresh — kubelet's contract guarantees it. So the cluster gives you a free signal: long-running pod plus 401s against the API equals suspect. Pods that simply re-read the file each call don't show up here.

kubectl get pods -A -o json | jq -r '
  .items[] | select(.status.phase == "Running") |
  select((now - (.status.startTime | fromdate)) > 3600) |
  "\(.metadata.namespace)/\(.metadata.name) age=\((now - (.status.startTime | fromdate)) | floor)s"' | \
  sort -k2 -t= -n -r | head -30

When we ran this on our cluster, the head-of-list was a pod that had been Running for 437 days. Pair the long-running list with audit logs filtered for 401s coming from an empty username field (the anonymous fallback) - anything in both lists is suspect. The other tool worth running here is kube-no-trouble (kubent); it originally caught deprecated APIs but recent versions check SA token patterns too.

Layer three: detection at runtime

Falco has a rule pattern that catches authentication failures at the apiserver, but the better place we landed on was the apiserver's audit policy itself. We added a rule that logs ResponseComplete events for the legacy-token-expired annotation, then alerted on a non-zero rate of those:

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  verbs: ["*"]
  resources:
  - group: ""
    resources: ["*"]
  omitStages: ["RequestReceived"]
  # Match anonymous fallback caused by expired tokens
  users: ["system:anonymous"]
  userGroups: ["system:unauthenticated"]

Promtail or Vector reads the audit log, the system:anonymous + 401 against /api/v1/.+/secrets pattern is the alert. The alert routes to the team that owns the namespace where the call originated. Hard to attribute (the user is anonymous on the audit side), but the source IP on the audit entry usually maps back to a pod CIDR you can correlate.

Remediation lives in the operator code. For in-cluster operators, swap to client-go's transport.NewCachedFileTokenSource - that's the five-line change. External clients use the TokenRequest API path mentioned earlier. Either way the change is small. Finding which operators need it is the slow part. The bar to fix it is low. The bar to find it is everything. We also wrote a Kyverno (Issue #16) admission rule that flags Pods mounting SA tokens with digest-pinned images for manual review - noisy but the right kind of noisy, surfaces five to ten genuinely old operators per cluster.

The bound-token transition is one of those Kubernetes changes that ages out of memory. The KEP was promoted to default in 2021. Every Kubernetes engineer hired since 2023 has only ever seen the new world. The bugs that remain are in code older than the change, owned by teams that turned over, running in clusters whose upgrade history nobody remembers. Auditing for them is a one-time pass that prevents a Saturday call that nobody on the current team has the context to debug.

What's next

Issue #20 stays on the theme of invisible failure modes that wake the on-call. The topic is image pull, specifically what happens when a pod's image registry quietly goes read-only mid-deploy and the kubelet's pull backoff never converges, while every other dashboard tells you the cluster is fine. Same general shape as this one: the system did exactly what it was designed to do, the assumptions baked in years ago no longer hold, and the alert that catches it is the one nobody thought to write.

- Ilia

Pod shutdown: the 30-second default that silently drops requests

Ilia Gusev — Fri, 22 May 2026 14:01:18 GMT

The first cluster I dug into this on was running clean rolling updates by every dashboard the team had, clean enough that nobody had ever investigated the thin band of 5xx that showed up on the edge during deploys. I wouldn't have either. Then a customer complained about a specific minute on a Tuesday, I traced it to a redeploy that had finished six seconds before that minute started, and I had a real example to chase. The dropped requests were happening inside the pod-termination sequence itself. The kubelet was sending SIGTERM to the container before the Service had finished removing that pod from its endpoints, which left a window where kube-proxy on at least one node was still routing fresh traffic into a process that had already started shutting down. The default terminationGracePeriodSeconds of 30 seconds doesn't help when the gap is on the front end of the drain. Issue #15 covered the cold-start side of the pod lifecycle. This one is about what I learned over the next two months reading kubelet source and rolling preStop hooks across most of our Deployments.

The rolling-update bug nobody attributes

Every team I've helped with this has had the same dashboard pattern. The deployment rolls. P99 spikes for a couple of minutes. A thin band of 5xx shows up on the edge graph. Then everything settles. Nobody pages. The release notes say "no impact" and somebody files a vague ticket about flaky deploys that sits in the backlog forever. I'd seen this graph on three different clusters before I traced what was actually happening, and on two of those I'd assumed it was a load balancer issue.

The 5xx come from inside the pod-termination sequence itself, because two events that look simultaneous on a deploy timeline are actually racing. On one side, the Service stops sending traffic to this pod after the EndpointSlice update has propagated everywhere. On the other, the kubelet sends SIGTERM to the container's PID 1 without waiting for that propagation. Whichever wins decides whether in-flight requests get drained or truncated mid-response. Kubernetes loses that race by default more often than the docs admit, and I've now seen the fix land cleanly on three production clusters with the same twelve lines of YAML.

What happens between kubectl delete and SIGKILL

When I ran kubectl delete pod and started watching what the apiserver actually did, what surprised me first was how little the kubelet talks to anything else. The apiserver sets deletionTimestamp on the pod and flips it to phase: Terminating. From there two things happen in parallel, and that parallelism is where my bug was living.

One side is the endpoint controller seeing the new deletionTimestamp, updating the EndpointSlice to remove this pod's IP, and pushing the change out. Every kube-proxy on every node picks up that update and rewrites iptables or IPVS rules. The numbers I measured on a quiet test cluster came in around 200 ms. On the busier cluster from the Tuesday incident, propagation was closer to 2 seconds with a long tail on a couple of nodes.

The other side runs entirely in parallel. The kubelet on the pod's node runs the preStop hook if there is one, waits for it to return, then sends SIGTERM to the container's PID 1. The kubelet doesn't check endpoint propagation. The kubelet doesn't even know it's happening. So with no preStop, SIGTERM lands before kube-proxy on some other node has gotten the memo, and that's where my 5xx had been coming from for years.

Once SIGTERM fires, the kubelet starts a terminationGracePeriodSeconds countdown. Default 30 seconds. That's the window the app has to finish whatever's in flight. If the container hasn't exited when the clock runs out, the kubelet sends SIGKILL. No drain. The kernel reaps the process. Anything mid-write is gone. The first time I caught this hurting us, it was a half-committed WAL entry that survived SIGKILL by maybe 200 ms, leaked one more replication step before the kernel reaped the process, and put the new replica into a state that recovery couldn't reconcile against. We tracked it down four days after the rollout had finished, with stale numbers in a downstream system being the only clue.

After SIGKILL, the apiserver finishes the deletion path. The Deployment controller schedules a replacement pod, and the IP eventually gets recycled.

A clean timeline for a default pod:

t=0.00s   kubectl delete (or Deployment rollout)
t=0.01s   pod.deletionTimestamp set; phase=Terminating
t=0.01s   kubelet: preStop runs (if any)
t=0.01s   endpoint controller: removes pod from EndpointSlice
t=0.20s   kube-proxy on most nodes: iptables/IPVS updated
t=0.50s   kube-proxy on slow nodes: still routing to this pod
t=0.01s   kubelet: SIGTERM to container PID 1
t=30.00s  kubelet: SIGKILL if still running

That overlap between t=0.01 and t=0.50 is where my Tuesday-incident 5xx had been living. The SIGTERM timestamp and the kube-proxy-updated timestamp don't have a guaranteed ordering on any cluster I've touched. The kubelet doesn't talk to the endpoint controller, and nothing in the system synchronizes them.

Why endpoints removal lags SIGTERM

Once I started measuring each hop, the slowness started making sense. The endpoint controller in kube-controller-manager watches the Pod API. It picks up the deletionTimestamp and computes the new EndpointSlice membership, then writes that back to the apiserver. Two API round-trips, fast.

The apiserver fans that EndpointSlice update out to every watcher. On the cluster where I first measured this, that meant kube-proxy on every node plus the in-cluster service mesh control plane plus the external load-balancer controller. About 200 watchers all getting the update over their watch channels at once.

The piece I had underestimated for embarrassingly long was kube-proxy on each node processing that update. iptables mode rewrites rule chains, which is O(services × endpoints), and on the cluster that finally made me read the source we had around 3,000 Services. I watched iptables updates take 4+ seconds during rollouts on that one. That's the entire reason teams move to IPVS or eBPF, though IPVS isn't instant either.

Conntrack is the part I lost the most time on. Linux's connection tracker remembers the destination of every flow, so even after kube-proxy updates the iptables rules, existing TCP connections keep flowing to the old pod until they close on their own. UDP behaves the same way up to the conntrack timeout. The rule update completing in 50 ms had made me think the network had converged when it hadn't, and I spent a couple of evenings chasing the wrong layer before someone on the SRE team showed me conntrack -L. Watching one flow still routing to a Terminating pod made the mental model click.

The numbers I see on production clusters now: from deletionTimestamp to "no more new connections arriving at this pod from anywhere", typically 500ms to several seconds. SIGTERM lands instantly. Whatever drain logic the app has is running against a clock that started before the network had finished telling other nodes the pod was leaving.

The preStop lifecycle hook done right

preStop is the only lever I've found to make the kubelet wait. It runs ahead of SIGTERM. The kubelet blocks on it before the grace-period timer starts at all. That's where I bridge the endpoint-propagation gap on every cluster I touch now.

The simplest preStop that actually fixed the Tuesday incident was a sleep:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 10"]

I deployed that across the top fifteen Deployments on the cluster the week after my investigation finished. The 5xx band on the next rollout was gone. Not narrower, gone. I refreshed the edge dashboard expecting to see something and instead saw a flat line where the band had always been.

For an HTTP server I want it slightly smarter. Flip readiness to false first, so external load balancers also see the change, then sleep, then let SIGTERM through:

lifecycle:
  preStop:
    exec:
      command:
        - /bin/sh
        - -c
        - "touch /tmp/shutting-down && sleep 15"

The readiness probe reads /tmp/shutting-down and reports unhealthy when the file appears. External LBs, the Service, and the app all converge on the same state before any connection draining starts. I've been running this pattern in production for a couple of years now and it survives most edge cases I've thrown at it.

The first time I traced this on an nginx pod was the easiest fix I've shipped. nginx -s quit was already doing the drain on its own, refusing new connections and letting in-flight ones close cleanly, so all I had to do was wire it into preStop and skip the sleep. Envoy has the same pattern with /healthcheck/fail followed by a delayed shutdown. Most HTTP frameworks ship an equivalent graceful-shutdown call you trigger from a SIGTERM handler, and I usually pair that with a small preStop sleep to cover the endpoint-removal lag before the listener actually closes.

Queue workers behave nothing like web servers in this regard. The team I helped with one queue cluster had jobs holding distributed locks, and when their preStop killed the worker without releasing those locks, the next replica had to wait out the lease timeout before it could retry. They'd been losing half the messages on every redeploy for six months. After we measured the actual longest job, the grace period on that workload had to come up to 120 seconds before redeploys stopped losing messages. Whatever the longest job actually takes is the floor for me now, and I don't trust intuition on it anymore.

A different cluster I helped on had a Postgres pool of about 50 connections with active transactions, and we found that draining cleanly took 30+ seconds on its own. Each in-flight transaction had to commit or roll back before the connection released. We ended up at 60s grace there before the new replica's pool stopped seeing connection errors at startup.

Tuning terminationGracePeriodSeconds

I keep finding 30 seconds wrong, in both directions, on the workloads I look at.

It's been too short on basically every stateful workload I've touched since the Tuesday investigation. The queue-worker cluster from earlier ended up at 120 seconds. The Postgres-pool cluster ended up at 60. A team I helped with gRPC streams where clients legitimately took 4+ minutes to wrap up - we ended up disabling SIGKILL entirely for those pods via finalizers and an external watchdog, because no grace period was going to be the right answer for that workload.

In the other direction, on one cluster where I helped the platform team, we wrote a Kyverno mutate policy that defaults grace to 10s for anything labeled tier=web-stateless. Their rolling updates had been taking 5x as long as they needed to, and the platform team kept getting blamed for slow deploys. The 30-second default isn't a conservative choice so much as a number that predates anyone caring about deploy speed at this layer.

The number I keep ending up with, after measuring instead of guessing, is the preStop sleep plus however long the worst in-flight operation can plausibly need, with a small safety margin on top. I set it explicitly:

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: api
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 10"]

One operational footnote that bit us on a Karpenter upgrade. The Eviction API used by kubectl drain and most node-upgrade tooling (Karpenter included) can cap the grace period regardless of what's in the pod spec. We had tuned grace to 120s for some queue workers, and node drains were still killing them at 30. It took me half a day to find that the eviction path was overriding us. I now check it once on any cluster where I've tuned this.

At SIGKILL there's no second chance. The kernel reaps the process. There's no postStop hook to flush anything, so whatever invariants the app holds in memory had better already be on disk. The first time we traced a Sev-2 back to this was a rolling update that had finished a week before the symptoms surfaced. Someone had bumped replica count, the rollout looked clean to everyone watching it, and a downstream system started returning stale data four days later because abandoned WAL entries had finally caught up with us in the analytics pipeline.

The mistakes I keep seeing on new clusters

The one I see the most on a new cluster is also the simplest: no preStop hook at all. Default config, default grace period, and 5xx errors during deploys that the team has stopped seeing because they've always been there. This is the baseline I now assume on any cluster I haven't worked on before.

A pair of related mistakes travels together. The app catches SIGTERM, logs "shutting down", and immediately calls os.Exit(0), so the signal handler bypasses the drain instead of triggering it. The readiness probe never flips, so preStop runs and the app starts shutting down, but the probe still reports "ready" because nothing told it otherwise, and external LBs keep sending traffic right up to SIGKILL. I've caught both by reading source code. Neither one has ever shown up in a dashboard for me.

The grace-period-shorter-than-drain failure mode is the queue-worker case from earlier, and I've now seen it on three different teams. None of them caught it from monitoring. It always surfaces because somebody manually traces a missing message and the timestamps line up with a redeploy. For each team, the right grace period had to come from measuring the actual longest job, because their intuition about it was off by close to 90 seconds every time.

The preStop-sleep-eats-the-grace-period mistake is sneaky. I spent two hours convinced a team's drain logic was broken before I noticed sleep 30 in their preStop sitting right next to terminationGracePeriodSeconds: 30. SIGKILL was firing the instant preStop returned.

And the one that means none of the above ever gets caught at admission: no policy guardrail. There's no default admission rule enforcing a minimum grace period for stateful workloads, and pods without a preStop hook quietly pass through unflagged. Every new team I've worked with rediscovers the problem the hard way during their first production incident. We eventually wrote Kyverno policies after the third team rediscovered it, and the new-team incidents went away the next month.

The deploys where I'd added preStop: sleep 10 and an honest grace period across the top ten Deployments on the original Tuesday-incident cluster didn't make any noise on rollout day. The 502 band on the next rollout just wasn't there. Nobody filed a ticket. The bug had only ever surfaced as the vague flaky deploys ticket sitting in the backlog forever, so the absence of new tickets was the only thing telling me the fix had landed.

kube-proxy modes: iptables vs IPVS, and why "proxy" is misleading

Ilia Gusev — Wed, 20 May 2026 14:02:22 GMT

The Tuesday morning I rolled out IPVS to a 4,000-Service cluster, I had to roll it back by lunch. The plan looked clean on the change-management ticket: flip kube-proxy from iptables to ipvs mode on every node, watch the dashboards, take the win. The dashboards showed the win immediately. p99 latency on our busiest namespace dropped about 18% within ten minutes and the team Slack picked up the first round of victory emoji.

Then around 10:40 the first ping came in: a small internal Service had stopped getting traffic. Then a second. The third one was a payment service and I stopped reading Slack, opened the rollback playbook, and started typing.

Subscribe now

By the time the cluster was back on iptables, about thirty Services had quietly broken during the IPVS window. They had one thing in common, which I didn't know that morning but spent the next two days learning. This post is the one I wish someone had handed me on Monday.

kube-proxy doesn't actually proxy

kube-proxy doesn't actually proxy. I'd been running Kubernetes clusters for several years before that fact lodged in my head as something operational rather than trivia. The name is a fossil from 2014, when the original implementation really did sit in userspace and shovel bytes between sockets. That mode (--proxy-mode=userspace) was deprecated a decade ago and removed entirely in 1.26. What survived the rename is the daemon's process name, but the job changed completely underneath it.

Today kube-proxy is a rule generator. It watches the API server for Services and EndpointSlices, then programs the kernel - iptables chains, IPVS tables, or now nftables sets - and steps out of the data path. Your packets never touch the kube-proxy process. They hit nf_tables or ip_vs directly and get rewritten in microseconds. kube-proxy just decides what the rules say.

That had been an abstract piece of knowledge for years - something I'd happily explain on a whiteboard in onboarding sessions. The Tuesday morning the Service stopped routing was the first time it cost me a couple of hours. The kube-proxy logs said nothing useful. The packet was breaking somewhere in nf_tables and the daemon that had configured those rules was already three reconcile cycles behind by the time I noticed.

The packet path under iptables mode

That afternoon I had iptables-save running on one screen and the Linux kernel netfilter docs open on the other. When you create a ClusterIP Service, kube-proxy in iptables mode writes a stack of netfilter rules into three custom chains: KUBE-SERVICES, KUBE-SVC-, and KUBE-SEP- (SEP = Service EndPoint).

I literally drew the walk on paper that afternoon because I needed it slow. A packet leaves a pod with dst=10.96.0.42 (the ClusterIP), hits PREROUTING, jumps to KUBE-SERVICES. That chain is a flat list, walked top to bottom matching on (clusterIP, port, protocol). On a match, the kernel jumps to the per-Service chain KUBE-SVC-XXX, which holds one rule per backend pod.

The piece I kept getting confused about was the probability math. Each rule inside KUBE-SVC-XXX is gated by a statistic mode random probability clause where the first rule fires with probability 1/N, the second with 1/(N-1), and so on. Once one fires, the kernel jumps to that endpoint's chain KUBE-SEP-YYY, which performs the DNAT (rewrites dst from the ClusterIP to the pod IP) and returns. After that the packet has a real pod address and routes out the node's CNI interface.

You can see all of this with iptables-save:

$ sudo iptables-save -t nat | grep KUBE-SVC- | head -3
:KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0]
:KUBE-SVC-JD5MR3NA4I4DYORP - [0:0]
:KUBE-SVC-TCOU7JCQXEZGVUNU - [0:0]

Each KUBE-SVC-XXX is one Service. Each KUBE-SEP-YYY is one endpoint. Our cluster - about 4,000 Services with maybe six pods per Service on average - was carrying around 4,000 KUBE-SVC chains plus 24,000 KUBE-SEP chains, plus the master KUBE-SERVICES list with 4,000 entries in it. Every packet that hit a ClusterIP walked at least one chain linearly. By the time I was looking at it, that linear walk was costing us about 6% of a core per node - not a crisis, but enough that someone had pinned a graph of it to the team's whiteboard.

The return path is where it got interesting for me. When the SYN goes out, netfilter creates a connection-tracking entry: (src_pod_ip, src_port, dst_pod_ip, dst_port) plus the original ClusterIP it was rewritten from. When the reply comes back from the real pod, conntrack matches it and rewrites src back to the ClusterIP so the originating pod sees a coherent conversation. Without conntrack the asymmetric NAT just breaks TCP. None of this works if conntrack falls over. Hold onto that - it comes back later.

Why the migration broke

The other thing I learned that afternoon was rule reload. Every time an endpoint changes - a pod added, deleted, gone NotReady - kube-proxy regenerates its rule set and pushes the new set into the kernel. Before the iptables-restore optimisations in 1.20 and the incremental-sync work that followed, this was a serial rewrite of the entire ruleset. On a quiet cluster you don't notice. On our cluster doing a 50-replica rolling update during the IPVS migration window, kube-proxy was spending about fifteen seconds per node shuffling rules.

That was fifteen seconds where new connections were seeing stale endpoints. Some packets routed to pods that no longer existed and bounced. Some routed correctly. The unlucky ones got TCP resets, the really unlucky ones got partial responses, and a few got the kind of half-state where the client thought it had a connection and the backend had never heard of it.

That's what had broken those thirty Services. They weren't all broken. They were intermittently broken during the moment kube-proxy was catching up, and the percentage of intermittent failures was small enough that the cluster-wide error rate barely moved on our dashboards - but for any individual user it was sometimes 100%, depending on which retries landed inside the reload window.

IPVS mode mechanics

IPVS (IP Virtual Server) is the load-balancer that lives inside the Linux kernel - the same one that powers LVS, the load balancer that ran a lot of internet infrastructure long before Kubernetes existed. Its data structure is a hash table, not a chain, so lookup is constant-time regardless of how many Services you have. The constant-time lookup was the reason I'd planned the migration in the first place.

When you flip kube-proxy to --proxy-mode=ipvs, two things change. For each Service, kube-proxy creates an IPVS virtual service keyed on (ClusterIP, port, protocol). For each endpoint of that Service, it adds a real-server entry. The packet path becomes: pod sends to ClusterIP, kernel does the IPVS lookup, picks a backend by the configured scheduler, DNATs out. You inspect this with ipvsadm:

$ sudo ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.96.0.42:80 rr
  -> 10.244.1.7:8080              Masq    1      4          1
  -> 10.244.2.11:8080             Masq    1      3          0
  -> 10.244.3.4:8080              Masq    1      2          2

The rr in that output is the scheduling algorithm. kube-proxy lets you pick from a handful - round-robin, weighted variants, least-connection, source/destination hashing - but on a normal day only two are worth thinking about in production. lc (least-connection) smooths out load when your endpoints have different capacity, which is the common case if you're running heterogeneous node types. sh (source hashing) gives you crude session affinity without needing cookies or Ingress configuration. Default rr is fine for everything else, and the weighted variants exist for specific cases you'll know when you hit them.

What kept biting me on the migration is something I'd skimmed past in the docs and only really learned at 11:30 that morning: IPVS doesn't replace iptables completely. kube-proxy still leans on netfilter for several auxiliary jobs that IPVS doesn't cover - egress source NAT, NodePort packet marking, the KUBE-FIREWALL chain dropping invalid packets - and keeps these rules compact with ipset (one set of ClusterIPs, one of NodePorts). iptables-save | grep KUBE on an IPVS-mode node returns fewer chains than iptables mode would have, but never zero.

In practice that meant my iptables -L debugging muscle memory became wrong overnight. The chains I was used to inspecting weren't there in any useful form. The data plane was in ipvsadm -L -n and ip route show table local, and I spent forty minutes that morning grepping iptables chains that no longer existed before someone on Discord nudged me toward ipvsadm.

Choosing between modes

For most clusters the question doesn't really come up. A 100-Service cluster runs the same on either mode and the difference is hard to measure. Where it started mattering for us - and probably for any team that runs Kubernetes at production scale long enough - was at the intersection of scale and churn.

Most production clusters in the world run iptables mode, and run it fine. A few hundred Services, modest endpoint churn, no exotic scheduler needs - that shape of cluster doesn't care which mode it's on. The mode has been the default since 2016, the tooling has been polished for nearly a decade, and any engineer who's run Kubernetes for more than six months can debug it cold from iptables-save output. The operational story is just simpler at that scale.

IPVS started making sense for us when scale and churn broke iptables' simplicity, not before. Tigera's benchmarks from a few years back showed rule-reload time growing roughly linearly with Service count in iptables and staying flat in IPVS. The crossover lands somewhere between 1,000 and 5,000 Services depending on kernel version. Past that point IPVS pulls ahead on rule-reload latency and on the operational cost of debugging a churn-heavy cluster where endpoints recreate every few seconds. The upstream Kubernetes docs at kubernetes.io/docs/reference/networking/virtual-ips/ cover the canonical mechanics if you want a second source.

Our cluster sat in the awkward middle: about 4,000 Services with namespaces deploying every few minutes. By the chart we should have benefited from IPVS, and eventually we did. The Tuesday told me more about staging than about modes. When we eventually redid the migration per-namespace with monitoring between batches, the breakage showed up at the first fifty Services and the rollout paused there for diagnosis instead of running blind to four thousand. Conntrack capacity got raised before the second flip too - IPVS generates connection patterns that stress conntrack differently than iptables had, and that part of the story came out of the original rollback rather than from the docs.

CPU on hot nodes was the other axis I'd been tracking. iptables burns CPU on packet processing during chain traversal and on kube-proxy itself during reloads; IPVS burns less on both, at the cost of a bit more memory for the hash tables. On a 1,000-Service shadow cluster I'd benchmarked beforehand, the gap measured at about 3% of a core per node - real but not dramatic. On the 4,000-Service main cluster the gap was bigger and would have repaid the migration cost. It just had to actually finish migrating.

There's a quieter dimension I didn't appreciate at the time, which is nftables availability at the OS layer. Older distributions still using iptables-legacy make iptables mode run slower than it needs to. Modern distros with the iptables-nft shim (iptables binary, nftables backend) close the gap to IPVS significantly without any cluster-level change. Our nodes were already on iptables-nft, which I should have read as a signal to question whether the IPVS migration was worth the operational complexity at all.

Mistakes I keep collecting

Since that Tuesday I've put together a short list of mistakes I keep hitting, mine and other teams', when kube-proxy modes change underneath someone.

Debugging IPVS clusters with iptables. This was my own mistake first. You switch the cluster to IPVS, hit a Service routing problem, and reach for iptables -L -t nat | grep KUBE-SVC- because that's the muscle memory. The chains aren't there. The data path lives in ipvsadm -L -n and ip route show table local. Build IPVS muscle memory or you'll spend an hour chasing chains that don't exist. I spent forty minutes that first morning. A team I helped with a similar incident a year later spent ninety minutes before someone on the call asked the right question.

Conntrack table bursting was the second thing that came up, on a different cluster about six months later. Both modes lean on conntrack. The table has a fixed size (nf_conntrack_max, default scales with RAM), and on a busy node with lots of short-lived connections - Ingress controllers talking to Services were our canonical case - the table can fill and packets start landing in dmesg as nf_conntrack: table full, dropping packet. We watched cat /proc/sys/net/netfilter/nf_conntrack_count over time and put a Prometheus alert at 60-70% of nf_conntrack_max:

$ sudo conntrack -L | wc -l
178432
$ cat /proc/sys/net/netfilter/nf_conntrack_max
262144

Once we sized the table up and shortened a few of the longer timeouts, the dropped-packet logs went quiet. The cluster that hosted our Ingress fleet was the one where this hit - post-IPVS the connection patterns had shifted underneath us in ways we hadn't predicted.

Mixed-mode confusion has bitten me twice now, which is exactly two times more than I'd like. kube-proxy mode is per-node, not per-cluster - if a mass IPVS rollout misses a handful of nodes that didn't restart kube-proxy (drained late, networking glitch during the rollout, whatever), the cluster ends up running two modes simultaneously. Both modes work in isolation. Debugging behaviour that depends on which node a connection landed on is painful, and traffic that hops between modes mid-migration produces error patterns that look nothing like normal failure modes. After the second time, our rollout checklist grew a line: confirm with kubectl get pods -n kube-system -o wide plus kubectl logs ds/kube-proxy before declaring the rollout done.

The last one I see frequently now is teams treating nftables mode like iptables mode because the syntax looks broadly familiar. kube-proxy gained a native nftables mode in 1.31, and while it's similar in spirit to iptables, the underlying ruleset format is the modern nftables one. Sets and verdict maps replace linear chain traversal with O(1) lookups. Rule and chain names sit far enough off the iptables originals that runbooks copy-pasted from older incidents stop working. The two teams I watched do 1.31+ upgrades both hit this within the first week - their iptables-savvy debugging tooling didn't port over.

Two years later

Looking back two years on, the migration story was less about iptables-vs-IPVS than it had felt at the time. I'd been thinking about kube-proxy as the data plane and it isn't. Once the question stopped being "chains or hash tables" and started being "what's the actual control loop telling the kernel to do, and how fast does it converge under failure modes I haven't tested", the rest of the work got noticeably less mysterious. The Tuesday morning rollback came from my mental model being wrong, not from anything IPVS did.

Kubernetes 1.31 shipped nftables as a beta proxy mode; 1.33 promoted it to stable. The shape is straightforward - kube-proxy stays as a rule generator, the iptables interface drops away, nftables rules go in directly. O(1) set lookups sit in the data structure now, which is what we feel during reloads on busy clusters. Incremental rule updates ride along too: kube-proxy no longer regenerates the whole ruleset on every endpoint change. The new clusters we've spun up since 1.33 default to nftables mode, and the rule-reload latency that drove our IPVS migration has stopped being something anyone graphs. iptables is officially in maintenance mode and nftables is the long-term direction for Linux packet filtering anyway, so the alignment is convenient.

And then there's the eBPF route. Cilium has been pushing kubeProxyReplacement: true for years, and the shape is simple: the Service-to-endpoint mapping lives in a BPF map, lookups are constant-time hash hits, no rule generator anywhere on the node. We ended up there two clusters later. The iptables-versus-IPVS conversation we'd had on that Tuesday simply stopped applying - no chains to traverse, no rules to reload, no ipvsadm to learn. By the time anyone asked which mode was faster, the answer for our team was "we run neither".

Issue #018 - Flux OCIRepository: the GitOps that stopped using Git

Ilia Gusev — Tue, 19 May 2026 14:00:37 GMT

Ask ten platform engineers what GitOps means and at least eight will say "the cluster pulls manifests from Git." That's the part everyone remembers, and it's the part Flux is quietly walking away from. The new default isn't Git. It's an OCI registry, the same one your container images already live in.

This issue is about what happens when you stop polling Git and start polling a registry, why Flux added that capability, and what it tells you about what GitOps was always actually about.

Subscribe now

🏗️ Architectural Pattern: when the registry becomes the source of truth

The original GitOps loop looked like one arrow. Git on one end, cluster on the other, a controller in the middle that did git clone, ran kustomize build or helm template, then kubectl apply. Whatever was in main was what ran. Reconcile every minute. Done.

A monorepo with two hundred apps is not a thin arrow. The controller does a shallow clone, sure, but it's still pulling commit history, branch metadata, and the entire tree just to read one path. SOPS-encrypted secrets need decryption inside the cluster. Helm rendering needs the chart cache. The "cluster pulls from Git" sentence hides a lot of work happening inside the cluster, on every reconcile, for every tenant.

The other thing that arrow hides is authentication. Git over SSH means an SSH key sitting in a Secret inside the cluster, with whatever blast radius that key has. Git over HTTPS means a token doing the same. Both are long-lived credentials. Neither integrates with the cloud IAM that already governs every other thing the cluster touches. You end up with a parallel auth domain just for Source Controller, which then needs its own rotation policy, its own incident response when it leaks, its own grumpy security review.

The Gitless GitOps move is to take that work out of the cluster. CI does the rendering. CI does the decryption. CI runs conftest, kyverno test, kubeconform, whatever your policy stack is. The output - a frozen tarball of plain manifests - gets pushed to an OCI registry as an artifact, tagged with the commit SHA. The cluster never sees Git. It pulls one tarball, content-addressed by digest, and applies it.

What "immutable" actually buys you

Git is mutable. You can force-push. You can rewrite history. You can delete a branch. Most teams have policies against it, but the storage layer itself doesn't care. A tag in Git is a movable label.

An OCI artifact pinned by digest is the opposite. sha256:abc123... resolves to exactly one byte sequence forever, or it resolves to nothing. The registry can refuse new pushes to an immutable tag, the way ECR and Harbor both can. There's no "force push" verb in the OCI spec.

When your Kustomization references an OCIRepository by digest, you get something Git can't give you cheaply: the exact bytes that ran in staging are the exact bytes that run in prod. Hydration drift, the thing that bites every team with a complex Helm setup, stops being possible because hydration happened once, in CI, and got frozen.

Content-addressed delivery, the same way images work

You don't git clone your application into the kubelet. You build a binary, layer it into an image, push the image to a registry, and the kubelet pulls a content-addressed blob. The image digest is the contract. The registry handles authentication, replication, caching, signing, scanning. None of that lives in Git.

OCI artifacts extend the same pipe to non-image payloads. The spec carved out mediaType for arbitrary content. Helm charts have been shipping as OCI artifacts for years. Flux's OCIRepository source is the same idea for plain manifests. The bytes are different, the wrapper and the delivery mechanism are identical to what you already trust for application images.

Git stops being a delivery channel. It goes back to being a code review system, which is what it was good at.
The registry becomes the configuration plane, sitting next to the image plane, sharing auth and replication and signing.
The cluster does one thing: pull a tarball by digest, unpack, apply. No template engines, no decryption, no policy evaluation. The complicated work is upstream.

Where Bucket fits

OCIRepository covers the case where you've got a real registry. Some teams don't, or don't want to. Bucket is the same idea with S3 or GCS or any S3-compatible store underneath. Source Controller polls the bucket, packs whatever's there into a tarball, exposes it over HTTP to the rest of Flux. The semantics match: pull a frozen blob, apply it.

Bucket shines for things that have no business being in Git. ML model files of two gigabytes each, the kind nobody wants in their commit history. Database dumps for ephemeral environment seeding live well here too. Big static assets, init-container payloads, similar story. Object storage is the right primitive for any of these, and Source Controller treats it as a first-class source.

Three docs cover the rest of what's worth knowing on the architectural side.

📑 RFC/KEP Read: OCIRepository, Bucket, and the hydration pipeline

The Flux pieces that make this work split across two controllers. Source Controller is the one that turns external storage into in-cluster Artifacts. Kustomize Controller (or Helm Controller, depending on what you're rendering) reads those Artifacts and applies them.

The OCIRepository CRD

A minimal OCIRepository looks like this:

apiVersion: source.toolkit.fluxcd.io/v1
kind: OCIRepository
metadata:
  name: app-manifests
  namespace: flux-system
spec:
  interval: 5m
  url: oci://ghcr.io/podostack/app-manifests
  ref:
    tag: v1.4.0
  provider: generic

The interesting fields:

url: registry path, no tag. Tag and digest go under ref.
ref.tag / ref.semver / ref.digest: three ways to pin the version. Tag is the loosest, semver lets you say ">=1.4.0 <2.0.0", digest is the strongest contract.
provider: generic, aws, azure, or gcp. The non-generic providers wire in cloud IAM, so the controller's ServiceAccount carries IRSA or Workload Identity instead of a static credential.

Pinning to a digest is what gets you the immutable guarantee. Pinning to a tag is convenient but the tag can move - unless the registry enforces immutability, which most do for production paths.

The reconcile loop is straightforward. Source Controller hits the registry every interval, compares the resolved digest against what it has, pulls and extracts on change, and exposes the result at an HTTP endpoint for other Flux controllers.

Verification with cosign

The spec.verify block is where this pattern stops being just "git over a different transport" and starts being something Git can't match cheaply:

spec:
  verify:
    provider: cosign
    matchOIDCIdentity:
      - issuer: "^https://token.actions.githubusercontent.com$"
        subject: "^https://github.com/podostack/app/.+$"

This says: refuse to use this artifact unless cosign verifies it was signed in a GitHub Actions run on a workflow inside the podostack/app repo. Keyless signing through Fulcio means there's no key to rotate, no key to leak, no key sitting in a secret somewhere. The signature ties the artifact back to the CI workflow that produced it through OIDC.

If you've read Issue #016 on Kyverno, this is the supply-chain end of the same picture. Kyverno verifies image signatures at admission. Flux verifies manifest signatures at source. Both lean on Sigstore, both reject unsigned blobs by default once the verify block is in place. The cluster ends up with a clean property: nothing applied to it exists without a chain of custody back to a CI run.

The Bucket CRD

apiVersion: source.toolkit.fluxcd.io/v1
kind: Bucket
metadata:
  name: ml-models
  namespace: flux-system
spec:
  interval: 10m
  provider: generic
  bucketName: prod-ml-models
  endpoint: s3.eu-central-1.amazonaws.com
  region: eu-central-1
  ignore: |
    !*.onnx
    !*.pt
    logs/
    tmp/

The trap people fall into is leaving ignore empty. Source Controller will then try to pack the entire bucket into a tarball in memory, and if the bucket is half a terabyte of training data, it does not end well. The ignore field uses .gitignore syntax, and treating it as required is the right move.

The CI side: `flux push artifact`

flux push artifact oci://ghcr.io/podostack/app-manifests:v1.4.0 \
  --path=./dist/manifests \
  --source="$(git config --get remote.origin.url)" \
  --revision="main@sha1:$(git rev-parse HEAD)"

The --source and --revision flags land in artifact annotations and the Flux UI uses them to show the producing commit. Git stops being the delivery channel but stays the audit trail.

What you do before flux push artifact is where the pattern earns its keep. The CI pipeline renders manifests with kustomize build overlays/prod (or helm template . -f values-prod.yaml, depending on which you live in). Encrypted secrets get sops --decrypt'd, sealed back with kubeseal only if the cluster can't decrypt at runtime. Then the policy gate — conftest test or kyverno test against the rendered output — catches RBAC and image-policy violations before anything ships. That gate is the step everyone underestimates the first time they wire this up. After policy passes, kubeconform does a schema check, then cosign sign signs the artifact with the CI workflow's OIDC token. Schema first, signature second: a broken manifest shouldn't get a signature it doesn't deserve.

Every one of those steps used to happen, in some half-form, inside the cluster. Now it's done once in CI, frozen into one tarball, and reconciliation becomes pure apply.

Wiring it to Kustomization

The Kustomization resource that consumes an OCIRepository only changes one field compared to the GitRepository version:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: app
  namespace: flux-system
spec:
  interval: 10m
  path: "./deploy/prod"
  prune: true
  sourceRef:
    kind: OCIRepository
    name: app-manifests

sourceRef.kind: OCIRepository instead of GitRepository. That's the whole migration. When I first did this, I expected drama in the Helm Controller — health checks, decryption providers, all of it. There was none. The rest of Flux doesn't care which Source kind it consumes, and the downstream pipeline kept working unchanged.

What the cluster actually sees

End to end, three things end up on the cluster, and that's the entire surface area:

One OCIRepository resource pointing at a registry.
One Kustomization resource pointing at that OCIRepository.
A 200KB tarball pulled from the registry every five minutes.

What's not there is the more interesting list. There's no .git directory hidden under /var/lib. No long-lived SSH key parked in a Secret. No template engine running on the reconcile path. The cluster does pull-and-apply against a content-addressed blob, the same shape of operation it already does for every container image.

🔥 Hot Take: GitOps was never about Git

The name is a marketing accident.

When Weaveworks coined "GitOps" in 2017, the point was the loop, not the storage. Git happened to be the durable thing everyone had. The loop is what mattered, and the loop works just as well with an OCI registry on the other end.

The standard objection runs something like: "Git gives me an audit log. Git gives me PR-based review. Git gives me branch protection. You're throwing all of that away."

Git is still the place where humans write YAML and review each other's changes. Pull requests still gate merges. Branch protection still keeps main clean. CODEOWNERS still routes reviews. Signed commits still tie changes to identities. None of that goes anywhere. What changes is what happens after the merge.

Pre-OCI, the merge to main was the deploy. The controller polled the branch and applied whatever was there. The "audit log of what shipped" was the same as the "audit log of what got reviewed." Convenient, and also kind of fragile: a force-push or a poorly-reviewed merge ships immediately.

Post-OCI, the merge to main triggers CI. CI renders, validates, signs, and pushes an artifact. The digest is what the cluster runs. "What was running on cluster X at 14:00 UTC" becomes a registry query. The signature on the artifact links back to the CI workflow run, which links back to the commit, which links back to the PR. The chain is longer but stronger, because every link is verifiable cryptographically.

The team most likely to benefit already runs a serious image-supply-chain stack: cosign on every image, replicated registry, air-gapped pulls, scanning on push. Adding manifests to that same pipe costs almost nothing. The team least likely to benefit runs a dozen apps with no monorepo, no air gap, no signing requirements - for them GitRepository is the right answer. The pattern earns its complexity when scale or supply-chain demands push back against Git's limits, not before.

There's a cultural shift hiding in this too. The team that owns the registry is usually the security or platform team; the team that owns Git is usually the application team. Moving the deploy boundary from Git to the registry shifts where the supply-chain controls live. Issue #004 covered the same dynamic from the Crossplane and Backstage angle - the platform layer wants its own contract with the cluster, separate from whatever application teams push.

GitOps was a name for a loop. The loop still runs. The plumbing got better.

What's next

Issue #019 picks up the supply-chain thread with image-pull policy and registry mirroring in air-gapped clusters - the other half of "what does the cluster actually pull." Issue #020 follows with the Image Preload Operator, which finishes the picture by warming the kubelet's image cache so the artifacts you just signed and shipped don't pay cold-start latency on first deploy.

The arrow into the cluster is getting interesting again.

Inode exhaustion: the disk-full error that your free-space graph doesn't show

Ilia Gusev — Fri, 15 May 2026 14:01:31 GMT

The on-call engineer stares at a graph. The host has 40 GB of free disk. The application logs say write error: no space left on device. Every attempt to create a new file fails. Restarting the service doesn't help. Rebooting doesn't help. This makes no sense because the space is right there, unused.

The answer is the filesystem ran out of inodes, and df -h (which shows space) doesn't tell you that. df -i (which shows inodes) does, but most monitoring dashboards only track the first one. This post is about why inode exhaustion is its own failure class, why small-file workloads hit it first, and how to avoid getting paged at 3 AM for a disk that has plenty of disk.

Subscribe now

What an inode actually is

On Unix-style filesystems, every file and directory has two parts:

The inode, which holds all metadata about the file: permissions, owner, size, timestamps, and pointers to the data blocks where the contents live.
The data blocks, the actual bytes of the file's contents.

A file's name is not in the inode. Names live in directory entries (dentries), which are just tables that map names to inode numbers. When you run ls, you're reading a directory's dentries. When you actually open a file, you go through the dentry to get the inode, then through the inode to get the data blocks.

The number of files a filesystem can hold is limited by two things:

The data space available (how many bytes can you write).
The inode table size (how many distinct files can you have).

Most filesystems (ext4, for example) allocate the inode table at format time. Once the inode count is set, changing it requires reformatting the filesystem. XFS grows inodes dynamically and doesn't hit this limit in the same way, which is one reason big-file-count workloads on Linux tend toward XFS.

Why small-file workloads are the ones that hit the wall

The default ext4 inode-to-data ratio is one inode per 16 KiB of storage. On a 100 GB partition, that's about 6.5 million inodes. Sounds like a lot, until you hit:

A Maildir-style mail server with 100 million small messages.
A caching proxy (Squid, Varnish on disk) with millions of tiny cached responses.
A build system that keeps per-version artifacts forever.
A Prometheus TSDB with aggressive retention on high-cardinality metrics, producing millions of tiny block files.
A Docker registry running a cleanup policy that hasn't kept up.

Each of these creates many small files. Inodes get consumed at a rate way faster than bytes do. The filesystem has plenty of space but runs out of inodes. no space left on device is the generic error Linux returns in both cases, so the error message lies to you.

The diagnosis command everyone should know

Two flags of df:

$ df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       200G  160G   40G  80% /var

$ df -i /var
Filesystem        Inodes   IUsed    IFree IUse% Mounted on
/dev/sda1       13107200 13100000    7200 100% /var

First output: 40 GB free. Everything looks fine. Second output: 100% of inodes used, seven thousand remaining. That's the real state.

Every monitoring system should track both. If your Prometheus or Datadog is only alerting on disk percentage used (node_filesystem_avail_bytes), you're blind to inode exhaustion. Add node_filesystem_files versus node_filesystem_files_free.

The Prometheus query for an alert:

(
  node_filesystem_files - node_filesystem_files_free
) / node_filesystem_files > 0.85

Page when inode usage crosses 85%. You'll have time to clean up before the filesystem refuses new writes.

Finding the offender: where are all these files

When inode usage is high and you need to find the responsible directory, two commands help:

# Directories with the most files directly
$ find /var -xdev -type d -exec sh -c 'echo "$(ls -a "$0" | wc -l) $0"' {} \; | sort -rn | head -20

# Total file count per top-level directory (more useful for triage)
$ for dir in /var/*; do echo "$(find "$dir" -xdev 2>/dev/null | wc -l) $dir"; done | sort -rn | head

The second form is faster for a rough first pass. The first form shows which directories have thousands of entries directly (an indicator of a bad storage pattern that doesn't shard into subdirectories).

On Kubernetes nodes, common offenders:

/var/lib/containerd/* for accumulated image layers (run crictl rmi --prune).
/var/log/pods/* for orphaned pod logs when the kubelet log cleanup didn't run.
/var/lib/kubelet/pods/*/volumes/kubernetes.io~empty-dir/* for massive tmpfs-backed workloads that forgot to clean up.
/var/lib/docker/overlay2/* for older container runtimes.

Hard links, soft links, and the inode counter

A detail that matters for cleanup: hard links and soft links count differently.

A hard link is another dentry pointing to the same inode. Two names, one inode, one set of data blocks. Hard links don't consume extra inodes because they reuse the existing one. Deleting a file with hard links only frees the inode when the link count reaches zero.

A symlink (soft link) is its own file. It has its own inode and contains a text path to the target. Symlinks do consume inodes. A directory full of symlinks hits inode exhaustion just like a directory full of real files.

Practical consequence: if you're trying to reduce inode usage and you have a directory with 10 million hard links, rm-ing them doesn't free inodes in proportion. If that same directory had symlinks, each rm frees an inode.

Filesystem choices that change the calculus

Three common filesystems, three different stories:

ext4. Fixed inode count at format time. Reformat to change it. Safe default for servers with predictable file counts.
XFS. Dynamic inode allocation. Grows as needed, bounded by available space. No pre-allocation. Preferred for workloads with unpredictable or very large file counts.
btrfs, ZFS. Different conceptual models. ZFS has effectively unlimited objects but has its own resource limits (ARC memory, metadata block groups). Btrfs similar. Both can hit different walls but not the classic inode-exhaustion wall.

If you're building a storage tier for a Maildir server, a caching proxy, or any workload with millions of small files, XFS is usually the less painful default.

If you're stuck on ext4 and inode exhaustion is chronic, two options:

Reformat with mke2fs -N at creation time, setting a higher inode count (at the cost of disk space for metadata).
Change the inode density with -i flag: mke2fs -i 4096 creates one inode per 4 KiB instead of 16 KiB, quadrupling the count.

Both require reformat. There's no online resize path for ext4 inode count.

The production-grade checklist

For any Linux-based production infrastructure:

Monitor both disk space and inode usage. Same dashboard, same alerts.
Know which of your workloads produce small-file storms. Mail, caching, logs, container images. These are the inode-exhaustion candidates.
Prefer XFS for small-file-heavy workloads. Default ext4 everywhere else.
Alert at 85% inode usage. By 95% you're already in cleanup mode.
Keep a runbook for inode emergencies. The find commands above, the common K8s offenders, the escape hatch of reformatting the volume.

Summary

Inode exhaustion is one of those failure classes that only matters until it matters, and then it takes down your node at the worst time. It's not a bug, it's a filesystem choice that shows up in specific workload patterns. The fix is monitoring on the right metric and choosing the right filesystem for the workload profile.

When "no space left on device" doesn't match your free-space graph, check df -i first.

Subscribe now

For the surface where this matters on Kubernetes specifically (image pull storms, log rotation, ephemeral volume cleanup), see Cold Start: A Pod's First 60 Seconds. For Prometheus-side small-file awareness, Prometheus WAL Internals covers the TSDB block structure.

Cilium Egress Gateway: stable outbound IPs for pods that need them

Ilia Gusev — Wed, 13 May 2026 14:01:10 GMT

Every platform team eventually gets the same ticket. Team X's service needs to call Team Y's legacy API. Team Y says, "sure, give us your IP addresses and we'll whitelist them." Team X's pods have 47 different IP addresses across 15 nodes across 3 availability zones, any one of which might disappear in the next minute because Karpenter is consolidating. The IPs aren't stable, and cloud NAT gateways apply the whole VPC's egress through one IP, which is too coarse.

Cilium Egress Gateway exists for this exact problem. One feature of the Cilium CNI, usually skipped in intro tutorials, that solves a real integration problem most clusters eventually face.

The problem in concrete terms

Your application needs to reach an external service. The external service is one of:

A partner API that requires IP whitelisting.
A legacy database that sits behind a firewall with an IP-based rule.
A government service that only accepts traffic from registered source IPs.
A SaaS product with per-customer IP allowlists for compliance.

On the pod side, source IPs are ephemeral. They depend on the node, the CNI's IPAM scheme, whether the pod was restarted 30 seconds ago, and a dozen other factors you don't control. Static IPs for pods are either fragile (host network, not safe) or unscalable (assign each pod a reserved IP).

Cloud NAT gateways have the opposite problem. They work at the VPC or subnet level. Every egress from every pod on every node in that subnet gets the same NAT IP. Great for "all my cluster traffic exits through one IP." Terrible if you want to say "only the payment-processor pods should exit through this IP."

Cilium Egress Gateway is the piece that fits between these extremes. Per-pod granularity, stable egress IPs, no dependency on cloud-specific NAT.

How it actually works

Three parts:

Gateway nodes. Designate one or more nodes in the cluster as egress gateways. They have the stable IPs. You mark them with a label like node-role.kubernetes.io/egress-gateway=true.

CiliumEgressGatewayPolicy resource. A CRD that declares the routing rule. Two key fields:

endpointSelector: which pods (by labels) the rule applies to.
destinationCIDRs: which external IP ranges trigger the rule.

eBPF datapath. When a pod matching the selector sends a packet to an IP in the destinationCIDRs, the Cilium eBPF program on the originating node intercepts it. Instead of sending it out the node's default route, it tunnels the packet to one of the gateway nodes. The gateway node does source NAT (SNAT) on the packet, rewriting the source IP to the gateway's stable IP, then forwards to the real destination.

From the external service's perspective, traffic arrives from the gateway IP. The external service has no idea the real client is a pod that started 30 seconds ago.

Example policy:

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: payment-processor-egress
spec:
  selectors:
    - podSelector:
        matchLabels:
          app: payment-processor
  destinationCIDRs:
    - "203.0.113.0/24"
  egressGateway:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/egress-gateway: "true"
    egressIP: "192.0.2.10"

Pods with app: payment-processor sending to 203.0.113.0/24 exit through 192.0.2.10. Everything else flows normally. Other pods sending to 203.0.113.0/24 flow normally. The rule is both pod-specific and destination-specific.

Why this matters beyond one ticket

Four benefits that compound:

Granularity. Different services can have different egress policies. Frontend traffic goes out the default path. Payment processor traffic goes through gateway A. Analytics exports go through gateway B. Per-service blast radius, per-service audit.

Least-privilege firewalling. Instead of opening the external firewall for the entire IP range of your cluster nodes (which changes every time autoscaling runs), you open it for one or two gateway IPs that don't change. The attack surface shrinks to the ports you actually need.

High availability. Designate multiple gateway nodes. If one fails, traffic reroutes through the next one. Standard Cilium health-checking handles the failover. No external load balancer needed.

Cloud-portability. The mechanism works identically on AWS, GCP, Azure, bare metal, or any mix. Not tied to NAT Gateway, not tied to Route53, not tied to any cloud-specific primitive. Moving clouds, this is one fewer thing to rebuild.

Where this breaks or surprises

Three edge cases to understand before relying on Egress Gateway in production:

Gateway node is a single point of concentration. All matching traffic flows through one or a few nodes. CPU and NIC bandwidth on those nodes matter. For high-throughput egress (bulk data exports, video streaming), gateway nodes need to be sized for the load, not just "whatever was cheap."

Reply traffic uses normal return path, not the gateway. The external service sends its response to the gateway IP. The gateway routes it back through the cluster's internal network to the original pod. If your return traffic pattern is heavy (large response payloads), the gateway has ingress load too.

Tunnel overhead. Traffic between originating node and gateway node is tunneled (VXLAN or similar). Adds some latency and CPU cost. Measurable for latency-sensitive workloads, invisible for batch jobs.

Connection tracking lives on the gateway. If the gateway node restarts, in-flight connections break. Client retries usually handle this, but long-lived TCP connections (database replication, gRPC streaming) can be more sensitive.

Comparison with alternatives

If Egress Gateway doesn't fit, what else is there?

Cloud NAT Gateway (AWS NAT Gateway, GCP Cloud NAT). Whole-VPC scope, not per-pod. Fine if you have one tenant and one outbound identity.
Per-node SNAT via iptables. Works, but every node needs its own stable IP. Scaling a cluster means updating whitelists.
Calico Egress Gateway. Similar feature in Calico's CNI. Same concept, different implementation. If you're already on Calico, use it.
Proxy-based egress (Envoy, Squid). A sidecar or L7 proxy that applies egress policy. Useful for HTTP-layer rules but heavier and more complex than Egress Gateway for simple IP-stability needs.

For the "give this specific service a stable outbound IP" problem, Cilium Egress Gateway is usually the lightest, most granular answer if Cilium is already your CNI.

The operational playbook

If you're rolling Egress Gateway out, a few practical steps:

Designate gateway nodes with taints. Regular workloads shouldn't schedule on egress gateways. The taint keeps them clean.
Size for peak egress plus headroom. 2x expected peak is a reasonable starting point. Monitor eBPF program CPU on gateway nodes.
Start with a single non-critical service. Confirm the SNAT works end-to-end, check the logs at the external service, verify reply traffic routes correctly.
Document the IP allocations. Which gateway IP, which policy, which service. This outlives the engineer who set it up.
Alert on gateway node CPU and connection count. Saturation leads to silent packet drops before failover kicks in.

Summary

Egress Gateway is the Cilium feature that solves a real integration problem most clusters eventually hit. Not flashy, not in the keynote, rarely on best-practice lists. Just a small CRD, a few eBPF programs, and one specific problem (stable outbound IP for selected pods) cleanly solved.

If you've been using a cloud NAT Gateway for too-coarse egress policy, or hacking per-node SNAT rules that break on every autoscale event, this is the cleaner answer.

For the broader Cilium context and why eBPF datapath matters, see Cilium Deep Dive. For the governance side where egress policies get enforced at admission, Kyverno Beyond Admission covers policy-driven outbound control.

Podo #017: Postgres on Kubernetes: Five Places the Control Plane and the Database Fight Over Recovery

Ilia Gusev — Tue, 12 May 2026 14:02:02 GMT

Welcome back to Podo Stack. Running Postgres on Kubernetes stopped being controversial a while ago. The right operators exist, the storage stack works, people actually do this in production. What hasn't changed is that every team that tries it rediscovers the same five decisions, usually under pressure, usually in the wrong order.

This issue walks through those five. Not "should you run Postgres on K8s." Assume yes. The interesting question is what the control plane and the database should do differently than a VM setup, and where their recovery instincts collide.

Here's what's good this week.

CloudNativePG quorum failover and the K8s recovery race

When the control plane and the database both try to fix the same outage.

CloudNativePG (CNCF incubating) is the Postgres operator most teams converge on. The reason it earned that position is the quorum-based failover introduced in v1.28, which solves a problem every HA database on Kubernetes eventually hits: the control plane and the data plane both want to recover the same outage, and they don't coordinate.

A standard three-replica CNPG cluster is one primary plus two standbys, synchronous replication, quorum write-ack. When the primary Pod stops responding, two recovery paths kick in:

Kubernetes side: the kubelet on that node marks the Pod NotReady. After the configured terminationGracePeriodSeconds plus the controller reconcile delay, K8s attempts to restart the Pod. Expected downtime: 30-60 seconds.
CNPG side: the instance manager inside the surviving standbys detects primary loss through replication stream health checks. If the quorum majority agrees, it promotes a standby to primary. Expected downtime: under 10 seconds.

Both are trying to help. They also can't see each other. K8s thinks "my Pod is unhealthy, I will restart it." CNPG thinks "the primary is down, I will promote a standby." Two recovery paths, one cluster, competing endings.

The race shows up concretely as split-brain attempts. K8s successfully restarts the original Pod. It comes back up expecting to be primary, reconnects, and discovers a different Pod has been promoted. If fencing isn't configured or the instance manager doesn't correctly demote, you can briefly have two Pods accepting writes. That's unrecoverable corruption.

CNPG handles this with instance-manager health checks that explicitly fence the former primary before promotion, and with a data directory consistency check at startup. The operator tells K8s, through StatefulSet ordering and specific annotations, not to race.

The architectural rule: if your database has opinions about recovery, the control plane should defer. CNPG's operator pattern registers this preference explicitly. Running Postgres under a bare StatefulSet without an operator (or under a naive operator) is what produces the race.

Local vs replicated storage: let whoever replicates better own it

Database-level replication and storage-level replication compete.

Kubernetes gives you two broad storage models:

Local storage (hostpath, LocalPV, TopoLVM, ZFS LocalPV). Near-disk performance, no network layer on the hot path. Pod is pinned to the node that holds the data.
Replicated storage (Mayastor/OpenEBS, Longhorn, Rook/Ceph). Storage replicates across nodes at the block or volume layer through NVMe-oF or similar. Pod can reschedule to any node.

The instinct is to always pick replicated storage because "we need durability." That instinct is often wrong.

The question is: which layer knows more about your data? For workloads that already replicate at the application level, storage-level replication is a second, unaware copy of the same work, paying latency and cost twice.

Cassandra, MongoDB, Elasticsearch: replicate at the application layer. Local storage is the right pick. Storage-level replication adds latency without improving durability.
Kafka: in-cluster replication is a first-class feature. Local storage.
Redis Cluster: data already sharded and replicated by the cluster mode. Local storage.
Postgres with CNPG streaming replication + quorum: the database already handles replication. Local storage.
Postgres with a single Pod, no streaming replication: replicated storage is the answer because nothing else is replicating.

The case against replicated storage for a CNPG cluster is concrete. Mayastor replicates through NVMe-oF: every write on the primary Pod becomes a network round-trip to two other hosts before fsync returns. Your synchronous Postgres replication already makes two round-trips to standbys. You're paying for replication twice, through two different mechanisms that don't cooperate.

For a CNPG production cluster, the standard choice is local storage (LocalPV on NVMe) with sufficient replica count at the database layer. Storage is the fastest and simplest component. Durability lives in the replica set.

One asterisk: backup and disaster recovery still need something that reaches across nodes. Covered in block 5.

Quorum-aware PodDisruptionBudgets and zone anti-affinity

The default PDB will cheerfully take down your cluster.

Kubernetes ships PodDisruptionBudgets as a safety against voluntary evictions (node drains, cluster upgrades, Karpenter consolidation). The default pattern (max one unavailable Pod) works for most Deployments. For quorum-based systems it is exactly wrong.

A three-replica CNPG cluster can tolerate one Pod down. Two Pods down is loss of quorum, which is loss of write availability, which is an outage. A PDB with maxUnavailable: 1 allows exactly the disruption pattern that takes you down during a normal cluster upgrade.

The correct PDB for a three-replica quorum system is minAvailable: 2. Kubernetes will refuse to evict the second Pod if only one is still running. Node upgrades now serialize across Postgres Pods, even if the upgrade tool wanted to parallelize.

The second half of the story is placement. If all three Postgres Pods end up on the same physical rack, same zone, or (worst case) the same host, the PDB doesn't save you from a single rack power outage. PodAntiAffinity with zone topology keys is the standard fix:

Required (hard) anti-affinity: one Postgres Pod per zone. If your cluster has fewer healthy zones than replicas, Pods stay Pending until a new zone comes up. Correct for strict HA.
Preferred (soft) anti-affinity: try to spread, fall back to colocation. Correct for dev clusters with fewer zones than replicas.

There's a parallel to Podo #008 (RabbitMQ quorum queues). The same algebra applies: quorum system, N replicas, zone spread, minAvailable ≥ floor(N/2)+1. The shape repeats across every quorum technology.

Kyverno (Podo #016) can enforce this at admission: reject CNPG Cluster resources that don't have a properly-sized PDB and zone anti-affinity. One policy, half the quorum-PDB mistakes prevented at deploy time.

Connection pooler placement: the pod-vs-sidecar question

PgBouncer is a pod, not a library. Where it runs matters.

Postgres has a hard limit on concurrent backend connections (default max_connections: 100, typical production tuning: 200-500). Applications that open a connection per request overwhelm this quickly. The standard answer is a connection pooler: PgBouncer (most common) or Pgpool-II (more features, more footprint).

On Kubernetes, the pooler is a Pod. The architectural question is where that Pod lives.

Per-application sidecar. PgBouncer runs as a sidecar in every application Pod. Application connects to localhost:6432. Pros: connection locality, no network hop to reach the pooler. Cons: N application Pods × M backend connections per pooler = scale multiplication. Pool math gets ugly fast.

Centralized pooler Deployment. One PgBouncer Deployment (two or three replicas) in front of the cluster. All applications connect through it. Pros: single place to tune pool sizes, observe activity, rotate credentials. Cons: one more network hop, the pooler is a new failure domain.

CNPG-managed pooler (Pooler CRD). CNPG integrates PgBouncer as a first-class resource. The operator manages configuration, credentials, and rotation. Pros: declarative, operator-managed, consistent with the rest of the cluster. Cons: tied to CNPG, not directly portable.

For most production setups, the answer is centralized pooler, two or three replicas, behind a ClusterIP Service. The connection math is the deciding factor: total backend connections at Postgres equals (pooler replicas × pool size). A pool size of 50 across two pooler Pods gives you 100 backend connections, which fits inside a standard max_connections: 200 with room for replication and admin.

Transaction mode vs session mode is the other choice the docs rarely lead with. Transaction mode is the one that actually pools at scale. Session mode holds a backend connection for the entire client session and gives you the same connection count problem you started with. Transaction mode is the default production choice unless the application uses session-scoped features (prepared statements without protocol: extended, LISTEN/NOTIFY, advisory locks).

Backup primitives: application-layer and storage-layer, both

One backup mechanism is not enough.

Backup on Kubernetes for a database is usually framed as a choice: Volume Snapshots (CSI-based) vs logical dumps vs application-layer continuous archiving. The framing is wrong. Production databases need two layers, not one.

Continuous WAL archiving with pgBackRest (or CNPG's integrated Barman Cloud). Every WAL segment ships to object storage as it closes. Combined with a periodic base backup, you can recover to any point in time within retention. RPO of seconds, RTO proportional to the amount of WAL to replay.

Volume snapshots (CSI) for fast restore. A snapshot of the Postgres data volume taken while Postgres is consistent (via pg_backup_start/pg_backup_stop or CNPG's snapshot-aware API) lets you clone the cluster to a new Pod in minutes instead of waiting for a full base backup restore. RPO is the snapshot cadence. RTO is seconds to minutes.

Logical dumps (pg_dump) are the third layer. Not for disaster recovery, for schema migration portability, cross-version upgrades, and per-schema cloning. Running pg_dump on a primary under load is costly; run it on a standby or against a snapshot clone.

The layered architecture:

Volume snapshots every 4-6 hours. Fast restore for "the index got corrupted" cases.
pgBackRest continuous archive, base backup once a week. Point-in-time recovery for "we dropped the table in production yesterday."
Weekly logical dump to cold storage. Insurance against the other two failing together, plus migration utility.

CNPG integrates the first two through its Backup and ScheduledBackup CRDs. Volume snapshots run via the CSI plugin of your storage driver (most modern CSI plugins support VolumeSnapshot). pgBackRest or Barman Cloud run through CNPG's backup.barmanObjectStore configuration.

One warning: never rely on a single backup layer. An S3 misconfiguration that blocks pgBackRest also blocks your only recovery path. Snapshots live on the same storage class that's probably implicated in whatever caused the outage. Logical dumps have the smallest blast radius and take the longest to restore. Three layers, three failure modes, one coherent recovery story.

Closing

Five decisions, one theme: the control plane and the database can't ignore each other, but they also can't both own the same concern. Every one of these is a boundary question:

Failover: who detects, who promotes, who fences?
Storage: which layer replicates?
Disruption: who guards quorum, who schedules drains?
Connections: where does the pool live?
Backup: which layer is the authority for a given recovery time?

Run Postgres on Kubernetes the way CNPG expects it: K8s handles placement and networking, the operator handles the database state machine, and the two talk through a narrow interface. Run it without that contract and you discover, usually at 3 AM, that both sides are racing to fix the same outage.

Which of these hit your team first? The failover race is the most common. The storage decision is the one that bites cost hardest. The backup one is the one that actually wakes people up.

Subscribe now

Next week's evergreens pair well with this story. Cilium Egress Gateway gives database clients a stable outbound IP when your Postgres cluster needs to reach an external service. Linux inode exhaustion is the filesystem-level failure that hits WAL-heavy workloads first.

- Ilia

Debezium in production: the failure modes the docs don't lead with

Ilia Gusev — Fri, 08 May 2026 14:01:30 GMT

Debezium is the CDC tool of record for most teams running event-driven architectures on top of traditional databases. The docs make it look straightforward: deploy Debezium Server, point it at your database, connect a sink, watch change events flow. It does work that way. Until one of three things happens and nobody warned you they were possible.

This isn't a getting-started post. Assumption: you've deployed Debezium, you know what CDC means, you understand the basic WAL/binlog reading model. This is the operational underside: the failure modes that define whether your CDC pipeline survives contact with production.

Subscribe now

Fail-fast is a feature, not a bug

Debezium's first operational surprise for new operators: the moment the downstream broker is unreachable, Debezium halts. Not "log an error and retry." Not "buffer and continue." Literally stops reading the transaction log. Consumer lag (we'll use the word "lag" but the Debezium metric is MilliSecondsBehindSource) climbs linearly until the broker comes back.

This looks broken. It isn't. It's correct behavior, and once you understand why, you want it that way.

Debezium's design guarantee is that every committed database transaction produces exactly one downstream event, in order, without loss. The moment you weaken that guarantee, CDC stops being a reliable event stream and becomes best-effort logging. If the broker is down and Debezium were to buffer in memory and later dump, you've just introduced reorder, duplication, and possible loss (when Debezium itself OOMs during the buffering).

Fail-fast preserves the guarantee. When the broker comes back, Debezium resumes from exactly where it stopped. Zero loss, zero reorder, zero duplication. The cost is visible lag, which is exactly what you want your monitoring to catch.

The operational corollary: alert on MilliSecondsBehindSource. If it grows beyond your recovery-time objective, the broker or the pipeline is broken, not Debezium.

Poison messages: the DLQ is mandatory

A CDC event format that a consumer can't parse (wrong type, unexpected null, schema mismatch) is a poison message. A naive consumer NACKs with requeue=true. The message goes back to the queue head. The consumer picks it up, fails again, NACKs again. Thousands of times a minute. The queue blocks behind one bad event.

The fix at the broker layer is a Dead Letter Exchange (DLX). Configure the main queue with x-dead-letter-exchange pointing to a DLX, and configure your consumer to NACK with requeue=false after N retries. RabbitMQ routes the failed message to the DLQ where an engineer can inspect it, while new events keep flowing.

The pattern applies uniformly. Kafka has dead-letter topics. Pulsar has dead-letter topics. NATS JetStream has similar constructs. Whatever broker you use, declare the dead-letter path at queue-create time, never as an afterthought.

Common causes of poison messages in Debezium pipelines:

Consumer code written against an older schema version.
A SMT (Single Message Transform) bug on the Debezium side produces malformed output.
A database column changed from one type to another and consumers assumed the old type.

The third one brings us to the next failure mode.

Schema evolution: the quiet breaker

Databases aren't static. Eventually a DBA runs ALTER TABLE. Three cases, three different failure shapes:

ADD COLUMN. Debezium sees the DDL, picks up the new column, starts including it in the after block of JSON events. Consumers that ignore unknown fields (forward compatibility) keep working. Consumers that strictly validate schema fail on the first message.

DROP COLUMN. The field disappears from new messages. Consumers that require the field fail. Consumers with backward compatibility (default missing field to null or skip) keep working.

ALTER COLUMN TYPE. The most dangerous. If the database changes INT to TEXT, Debezium starts sending strings where consumers expected numbers. Validators pass (the schema says string), but application code breaks on type mismatch. Consumers without strict schema validation crash.

Without a Schema Registry (which Kafka has and RabbitMQ doesn't), the responsibility for compatibility lands entirely on consumer developers. Operational rule: every DDL in the source database must be reviewed against the CDC consumer list. Breaking changes (type changes, renames, drops of required fields) need a migration plan, not a deploy-and-hope.

The replication slot that will eat your database

This one is Postgres-specific and it's the one that takes databases down.

Debezium for Postgres uses logical replication slots. A slot is a Postgres object that tracks where a particular replica (Debezium, in this case) is in the WAL stream. As long as the slot exists, Postgres will not delete WAL segments beyond the slot's position. That's the whole point of slots: they guarantee no WAL is discarded before the replica consumes it.

The failure mode: Debezium Server dies (OOM, network partition, the RabbitMQ downtime from the first section, a misconfiguration). The slot stays. Postgres keeps WAL because the slot says "the replica will come back for these." WAL volume grows. Disk fills. Postgres, eventually, can't write anymore. The main database goes offline. Every application that depends on it fails.

I've seen this take down payment systems. It's not theoretical.

The saving parameter, introduced in Postgres 13, is max_slot_wal_keep_size. Set it (50 GB is a reasonable starting point) and Postgres enforces a hard limit. When unclaimed WAL exceeds the threshold, Postgres invalidates the slot, deletes the WAL, and keeps itself alive. Debezium loses its position and requires a fresh snapshot to resume. That's the correct trade: one data pipeline outage, resolved by reinitialization, versus a database outage that takes the whole business down.

Every Debezium-on-Postgres deployment needs max_slot_wal_keep_size set. The default is unlimited, which is the wrong default. Set the limit explicitly.

The metrics dashboard that matters

JMX metrics from Debezium plus the broker's own metrics form the operational dashboard. The ones that earn their space:

From Debezium:

MilliSecondsBehindSource: the lag metric. Normal is milliseconds. Alert on seconds.
TotalNumberOfEventsSeen: throughput. Used for capacity planning and anomaly detection (sudden drop = upstream stopped).
SnapshotRemaining: when initial snapshot is running, how many tables or chunks are left.

From the broker (RabbitMQ in the example, same concepts on Kafka):

messages_unacknowledged: consumer activity. Growing = consumer stuck.
messages_ready: queue depth. Growing = consumers slower than producers.

From the database:

pg_replication_slots.confirmed_flush_lsn vs pg_current_wal_lsn: the gap tells you exactly how much WAL is waiting for Debezium to consume. Growing = Debezium can't keep up or is down.

Three layers of metrics, one pipeline. If any layer degrades, you need the signal before consumers notice.

Summary

Debezium works, and Debezium is deterministic about how it fails. The difficulty is that the failures aren't the kind docs lead with. They're operational: broker outages freeze the pipeline, bad events can block queues without DLQs, schema changes break consumers silently, and most dangerously, a dead Debezium can take your database down if Postgres slot limits aren't set.

None of these are bugs. They're the cost of the "exactly once, in order, no loss" guarantee Debezium provides. Running it in production means understanding the trade and configuring around it.

Checklist for a Debezium-on-Postgres deployment:

Set max_slot_wal_keep_size to a sane limit (default is unlimited).
Declare Dead Letter Exchanges or dead-letter topics on every downstream queue.
Monitor MilliSecondsBehindSource with page-level alerts.
Treat every source-database DDL as a consumer contract event.
Accept that a cold restart after slot invalidation requires a fresh snapshot.

Subscribe now

For CDC concepts and when to use it at all, see the published Change Data Capture deep dive. For the broker side specifically, RabbitMQ in Production covers the message-delivery guarantees that pair with Debezium's.

Firecracker: the minimalism that runs your Lambda function

Ilia Gusev — Wed, 06 May 2026 14:00:29 GMT

Every AWS Lambda function you've ever invoked ran inside a microVM that started in less than 125 milliseconds, used under 5 MB of memory for the VMM itself, and was destroyed when your function returned. That microVM runs on a piece of software called Firecracker, written in Rust, open-sourced in 2018, and now quietly sitting under Lambda, Fargate, Fly.io, Kata Containers, and half the serverless infrastructure that bills you for single-digit milliseconds at a time.

Most engineers have heard the name. Very few have looked at what Firecracker actually is, why it exists, and where it fits in the boundary-choice conversation that now dominates multi-tenant isolation, AI-agent sandbox platforms, and untrusted-code execution at scale.

Here's the full picture.

Subscribe now

The dilemma that didn't have a clean answer

Before 2018, two ways to isolate arbitrary user code:

Containers (Docker, LXC). Fast start, high density, shared kernel with the host. Process-level isolation through namespaces and cgroups. Strong enough for internal workloads, weak for genuinely untrusted code. "Container escape" is a legitimate attack category.

Traditional VMs (QEMU/KVM). Hardware isolation through a hypervisor. Strong security. Slow start, measured in seconds or minutes. Memory overhead in the hundreds of MB per instance.

AWS needed something for Lambda. Thousands of untrusted functions from different tenants per physical host. Millisecond start times. Strict isolation. Nothing on the shelf fit. QEMU was too heavy. Containers weren't strong enough.

Firecracker is the answer AWS built. A VMM that keeps the hardware isolation of a VM but strips out everything that isn't needed for ephemeral stateless workloads.

What got cut

Firecracker is minimalism as a security feature. The things QEMU emulates that Firecracker refuses to:

No USB, no PCI bus, no BIOS, no graphics. A running microVM sees a minimal virtio-net device, a virtio-block device, a serial console, and one keyboard key for reboot. That's the full hardware surface.
No device passthrough. If you need a GPU, Firecracker is the wrong tool. Cloud Hypervisor or QEMU with VFIO handles that.
No live migration, no complex storage features, no snapshots (for a long time, though snapshots were added later).
No Windows guest support. Linux and OSv only.

Everything cut is attack surface removed. A minimal device model is a minimal set of bugs.

The architecture that makes sub-second boot work

Three choices explain the performance:

Rust at the foundation. Memory safety guarantees eliminate a whole class of bugs that plagued C-based hypervisors. Firecracker's CVE list is noticeably shorter than QEMU's for this reason.

API-driven, not CLI-driven. Firecracker exposes a REST API over a Unix socket. You POST a machine configuration, POST a disk image path, POST a kernel image path, then send an InstanceStart action. No process spawning, no command-line parsing, no shell. Orchestrators build against the API directly.

Jailer. A separate binary that sandboxes the Firecracker process itself using cgroups, namespaces, chroot, and seccomp-bpf syscall filters. If a guest escapes the microVM, it lands in a jailed process with minimal privileges. Two layers of isolation, both required.

The boot numbers are the headline:

Startup to running guest code: under 125 ms.
VMM memory overhead: under 5 MB per microVM.
Density on i3.metal: thousands of microVMs per physical host.

Built-in rate limiting for network and block I/O at the VMM layer means noisy-neighbor problems don't propagate across microVMs sharing a host.

Where Firecracker actually runs in production

Four categories, plus the one people forget:

AWS Lambda. The default execution environment. Your function runs inside a Firecracker microVM that was ready before your request landed.
AWS Fargate. The task runtime for ECS and EKS. Firecracker under the hood, presenting a container API.
Fly.io. Entire platform built on Firecracker as the primitive.
Kata Containers. A CNCF-sandbox project that runs standard OCI containers inside lightweight VMs for stronger isolation. Kata supports multiple VMMs; Firecracker is one of the popular backends. If your cluster has runtimeClassName: kata-fc, Firecracker is the boundary.
AI-agent sandboxes and sandbox-as-a-service platforms. E2B, Daytona, Modal, and the emerging category of remote code-execution platforms all use Firecracker-derived microVMs for isolating tool calls from agents or untrusted snippets from customer tenants. Strong boundary plus millisecond start is the combination that makes the category work.

Where Firecracker is the wrong tool

Three categories of workload that Firecracker can't handle:

GPU-dependent workloads. No PCI passthrough, no direct device access. ML inference and training stay on QEMU or bare metal.
Stateful databases. Disks are ephemeral by design. You can configure persistent storage, but you're working against the grain.
Windows or macOS guests. Not supported. Linux guest kernel only, with OSv as the other option for unikernel use cases.

If your use case needs any of these, look at Cloud Hypervisor (also Rust, newer, different design priorities), Kata with QEMU backend, or just QEMU directly.

The comparison that matters for platform decisions

A clean mental model for boundary strength:

Docker container - boundary: shared kernel + namespaces, start: ms, overhead: MB, untrusted-safe: NO
gVisor - boundary: userspace kernel, start: 100s of ms, overhead: tens of MB, untrusted-safe: yes (weak)
Firecracker microVM - boundary: KVM hypervisor, start: ~125 ms, overhead: 5 MB VMM, untrusted-safe: YES (strong)
QEMU VM - boundary: full hypervisor, start: seconds, overhead: 100s of MB, untrusted-safe: YES (strong)

The rows aren't interchangeable. Picking Firecracker over Docker is a conscious trade: stronger isolation for the overhead of a VMM. Picking Firecracker over QEMU is a different trade: less feature surface for faster boot and lower overhead.

For your own platform, the decision usually comes down to three questions:

Is the workload trusted or untrusted? (Trusted → Docker. Untrusted → microVM or VM.)
Does it need hardware passthrough or Windows? (Yes → QEMU. No → Firecracker.)
Does it need to start in under a second at scale? (Yes → Firecracker. No → QEMU is fine.)

Those three cover most real decisions.

The operational pieces that aren't in the Getting Started docs

If you stand up Firecracker yourself (outside a managed platform), a few things matter:

Kernel choice for the guest. Firecracker documents a minimal Linux kernel config. Running a distro kernel inside works but wastes boot time. The "alpine-microvm" or custom-built minimal kernel is the right choice.
Jailer is not optional. Running Firecracker without the jailer is a production mistake. Every serious deployment runs jailer.
Snapshot/restore (added in later versions). Lets you pre-create a microVM in memory and restore it to a fresh copy in milliseconds. A key primitive for warm-pool patterns in FaaS, AI-agent sandbox platforms, and any sandbox-as-a-service that needs sub-second cold start.
Networking model. Firecracker expects a tap device per microVM. At density, this becomes a host-networking concern, not a VMM concern. Common pattern: one physical host with thousands of tap devices bridged through a fast virtual switch.

Summary

Firecracker is what happens when a single use case (secure multi-tenant serverless) forces a complete rewrite of the VMM concept. The result is narrower than QEMU, stronger than a container, fast enough that boot latency stops dominating your worst-case tail.

If you're building a platform that runs untrusted code at scale, Firecracker belongs in your stack. If you're running standard workloads on standard clusters, Kata-with-Firecracker is an option for tenants you don't trust. And if you're watching the AI-agent sandbox and sandbox-as-a-service category, Firecracker-derived microVMs are the primitive that makes the rest of the security model work.

Subscribe now

For the container-runtime context including the Wasm alternative at the other extreme of the boundary axis, see Tools From the Future. For K8s-native runtime options, eBPF Beyond Networking covers Tetragon as a complementary runtime-security primitive.