Issue #020 - Image Preload Operator: zero-second cold start, even for 8GB images

pre-cache DaemonSet, image warm pool, Stargz callback, sub-second startup for 8GB images

Jun 02, 2026

Your inference pod schedules onto a fresh node. The image is 8GB. The pod sits in ContainerCreating for ninety seconds while the kubelet pulls it. Issue #15 was about why those ninety seconds are the way they are. This issue is about how to skip them.

This is the closer for the image-distribution series. Issue #3 looked at Stargz, which made cold start fast by being lazy - read what the container actually touches, ignore the other 94%. Issue #1 looked at Spegel, which turned every node into a peer and let the cluster share layers over its own network instead of hammering the registry. Both bet on a different shape of the same problem. Image Preload Operator makes a third bet, the most boring and the most effective one: have the bytes already on the node before any pod that needs them gets scheduled.

💎 Hidden Gem: Image Preload Operator

A DaemonSet that pulls images you haven't asked for yet

If you've ever run kubectl describe pod on a stuck inference workload and watched the Pulling image event sit there for over a minute, you already know the shape of the problem. The kubelet's pull is sequential, the registry is far, and your pod's startup latency is whatever number sits between the request and the first byte of the container being usable on disk.

The trick the operator pattern uses is not clever in the technical sense. It runs a DaemonSet on every node (or on a labeled subset), the DaemonSet calls into the container runtime - containerd, CRI-O, or Docker, whichever the cluster runs - and asks it to pull a configured list of images. The runtime stores those images in its local image cache, the same cache the kubelet would use anyway. When a pod for one of those images lands on the node later, the kubelet finds the image already present, sees imagePullPolicy: IfNotPresent, and skips the pull entirely. The container starts in whatever time it takes to set up cgroups and namespaces. For most workloads that's under a second.

The most popular implementation of this pattern is kube-fledged, which exposes the warm cache as a Kubernetes-native CRD called ImageCache. You write an ImageCache resource, the operator reconciles it into a Job that runs against the right nodes, the Job pulls the images, the operator tracks per-node status, and a kubectl get imagecache tells you whether every node in the pool has the bytes. There are a handful of other implementations - the OpenKruise project ships a similar primitive called NodeImage, and several teams just roll their own DaemonSet around a one-line crictl pull loop. They all have the same shape underneath.

What it's not

People keep filing it next to things it only resembles. A registry mirror is Spegel's job - the operator never sits in the pull path or proxies anything, it just kicks the kubelet's runtime into pulling early.

Stargz is the lazy-filesystem one, and the operator isn't that either. It doesn't touch how the image gets unpacked or read; the bytes land on disk exactly as they always would, and the only thing that shifts is the timing.

A baked AMI buries the image inside the node image itself, so the node has to be rebuilt whenever the image changes. The operator pulls dynamically instead. Push a new tag at noon and the next reconcile cycle lands it on every node, node image untouched.

Why the bet pays

The bet pays when image pulls are predictable. AI/ML inference is the textbook case. You run the same model server image on dozens of GPU nodes, the image is 6-12GB, and the cold start delta between "pull and run" and "already there, just run" is the difference between an autoscaler that responds in two minutes and one that responds in five seconds. The same logic holds for Spark executors and CI runner pools, or stateful databases that share a base image - anywhere the image set stays small and you know it ahead of time.

The bet doesn't pay when the image set is large and unpredictable. A multi-tenant cluster with five hundred different application images per node pool will not benefit from preloading - you'd burn the disk and most of the cached images would never be used. That's the Stargz case. Or the Spegel case, if you've got enough nodes that one will already have the image when another needs it.

The thing nobody mentions

The operator hands you a side benefit that sounds boring until you've needed it: a programmatic way to ask whether a given node has a given image. Once ImageCache.status carries per-node state, an admission policy can refuse to schedule a workload onto a node that's missing its declared image. Pre-flight checks before scaleup get easy the same way, and the Grafana panel that screams when warm-pool drift turns real basically writes itself.

Without the operator, that question lives in ssh-into-the-node-and-grep-crictl-images territory. The operator turns it into a kubectl get. Boring on a normal day - but I've reached for it at 3am more than once.

🔬 Trace: how the warm cache actually fills

The first ImageCache we wrote

Ours came out of a model rollout that kept missing its autoscaling target: new inference pods sat in ContainerCreating long enough that the request queue backed up before any of them were ready to serve. The fix was an ImageCache, and this is close to the one we started with:

apiVersion: fledged.kubefledged.io/v1alpha2
kind: ImageCache
metadata:
  name: ml-inference-models
  namespace: kube-fledged
spec:
  cacheSpec:
    - images:
        - registry.example.com/inference/llama-3-70b:v2.4.1
        - registry.example.com/inference/mistral-large:v1.8.0
      nodeSelector:
        node-role.kubernetes.io/gpu: "true"
    - images:
        - registry.example.com/runtime/triton:24.05-py3
      nodeSelector:
        node-role.kubernetes.io/gpu: "true"
  imagePullSecrets:
    - name: registry-creds

Writing it, the only real decision was the cacheSpec list, which maps image sets to node selectors. We pointed the model images at the GPU pool and kept them off everything else, so no node would burn disk on an image it was never going to run. Auth I'd braced for and it turned out to be nothing: imagePullSecrets is the same field the pod specs already used, so the private registry just worked.

Then we applied it and watched what happened. The controller picked up the new ImageCache, spun one Job per node-image pair pinned to its node, and each Job reached into that node's CRI socket and asked the runtime to pull. The part I hadn't expected was the bookkeeping. Every node wrote its result back into .status.nodes[], so checking whether the pool was warm became one query instead of an ssh-and-grep tour of the whole fleet.

Inside the Job

When I went digging into how the Job actually pulled, there was less to it than I'd assumed. The socket is the containerd one under /run/containerd on most of our nodes, or the CRI-O equivalent on the rest, and the DaemonSet mounts it as a hostPath volume and shells out to crictl pull. It's the exact code path the kubelet itself takes when a pod creates demand for an image, same socket and same content store. The only thing that changes is when it runs.

The question I kept circling back to was why a Job at all, instead of an init container in the workload pod. We tried the init-container version first. It pulled at pod-creation time, which was the one moment we were trying to get ahead of - we wanted the bytes on disk before the pod existed - and every replica ended up pulling on its own, with nowhere to look to see who was warm or to hold a scaleup until the cache caught up. The operator pulls before any pod exists and keeps every node's state in a single object, and that was the whole reason we moved off the init container.

The `kubectl describe` view, with and without

Without the operator, on a fresh node:

Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Normal   Scheduled  92s   default-scheduler  Successfully assigned ml/inference-7c4 to node-gpu-04
  Normal   Pulling    91s   kubelet  Pulling image "registry.example.com/inference/llama-3-70b:v2.4.1"
  Normal   Pulled     14s   kubelet  Successfully pulled image "registry.example.com/inference/llama-3-70b:v2.4.1" in 1m17s (1m17s including waiting)
  Normal   Created    13s   kubelet  Created container inference
  Normal   Started    12s   kubelet  Started container inference

77 of those seconds sat inside Pulling. The rest of the events are microseconds next to it.

The same pod, on a node the operator had already warmed:

Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Normal   Scheduled  3s    default-scheduler  Successfully assigned ml/inference-7c4 to node-gpu-04
  Normal   Pulled     2s    kubelet  Container image "registry.example.com/inference/llama-3-70b:v2.4.1" already present on machine
  Normal   Created    2s    kubelet  Created container inference
  Normal   Started    1s    kubelet  Started container inference

No Pulling event at all. The kubelet asks the runtime, the runtime says "already there," and the pod moves on. The 60-90 seconds Issue #15 spent dissecting are simply gone.

Where this falls over

Sounds clean. It is, right up until it isn't.

Image GC. We lost a 12GB Llama image to this on a Friday afternoon, and it took an embarrassing while to work out why. The kubelet runs its own garbage collector against the runtime's image store, governed by imageGCHighThresholdPercent (default 85%) and imageGCLowThresholdPercent (default 80%). Once disk on a node crosses the high watermark, the kubelet evicts unused images until usage drops back under the low one, and "unused" here means "not referenced by any running container." A freshly preloaded image, before any pod has landed on it, is exactly that: referenced by nothing. The GC was built to reap it. Tight disk plus a preloaded image, and the bytes you just paid for are gone before the workload that needed them ever schedules.

There's no clean fix upstream. The pragmatic move is a sentinel pause-container: a tiny pause pod per cached image so the GC counts it as in-use. kube-fledged ships this out of the box, and after I watched it save a node sitting at 91% disk that would otherwise have reaped its model image, I stopped thinking about GC thresholds at all. We still lower thresholds and oversize disks, but that's insurance against the wrong contract, not the fix.

Tag mutability. A CVE patch of ours quietly never reached production for two days, and preload was the reason. Push a new image under a tag that's already cached - rebuild nginx:1.25 overnight, say - and the nodes keep serving yesterday's bits. The kubelet sees the tag already present and asks no further questions, so the "rollout" becomes a no-op nobody thought to verify. Ours reported "all nodes cached" the entire time, while every node ran the vulnerable version. Preload by digest where you can, or wire a periodic re-pull on a cadence the security team owns; the :latest-is-evil argument from Issue #1 only gets sharper here, because preload makes the staleness sticky.

Pull storms on rollout. The first time you apply a large ImageCache, every node pulls every listed image at roughly the same moment. A hundred nodes and a 10GB image means a one-terabyte burst landing on your registry at once. Staging is the cheap mitigation: roll the ImageCache out to a subset of nodes, watch the registry breathe, then widen it. The better one is to pair the operator with Spegel, so the first node pulls from upstream and every other node grabs the layer from a peer over the cluster network.

The "warm cache + P2P mirror" pattern is the hybrid most teams who run this seriously end up at. We'll come back to it in the showdown.

🆚 Showdown: Stargz vs Spegel vs Preload

Three bets, one problem

Cold start latency is one problem with three philosophically different bets pointed at it.

Stargz (Issue #3) bets on laziness: don't pull what the container never reads. The image mounts as a lazy filesystem and bytes arrive from the registry on demand, so a container starts in a second or two even on a cold node, even at multi-gigabyte sizes. The price is FUSE in the I/O path and a standing dependency on the registry for chunks you haven't fetched yet.

Spegel (Issue #1) goes after locality instead. If the registry is the bottleneck, turn the nodes into the registry: each one serves layers it already has to its peers over the cluster network. The first pull anywhere still hits upstream, and everything after that runs at LAN speed.

Preload (this issue) is the boring one. The pull still happens, it just happens before a pod wants the image - by the time the scheduler picks a node, the bytes are already sitting there. The bill is disk and the ongoing chore of keeping ImageCache honest against what's actually deployed, and in exchange the whole latency tail disappears.

Where each wins, sharply

We've leaned on all three in production at one point or another, and the dividing lines turned out sharper than the project READMEs let on.

Stargz is the right call when images are small and land on many different nodes for short jobs, the CI-runner and serverless-backend end of the spectrum. The image set there is wide and shallow, you can't predict what to preload, and lazy loading is the only thing that keeps up.

Spegel earns its place on large clusters where the same image set rotates across hundreds of nodes: multi-tenant platforms and big SaaS fleets, where you're already paying for inter-node bandwidth and the registry has quietly become the bottleneck. Once one peer has a layer, the marginal cost of the next node pulling it falls to almost nothing.

Preload, the one this issue is about, pays off in the predictable case: the same big images going to the same nodes over and over. That's our AI inference fleet and the GPU training pools, plus Spark jobs and the stateful databases we keep on dedicated nodes. The set is narrow and stable, a fast cold start is worth real money, and the pre-pull cost can run off-hours when nobody's watching.

The hybrid that actually ships

Most teams that run this at scale don't pick one. They pair Preload with Spegel.

The first time the operator pulls an image, one node in the cluster talks to the upstream registry, pulls the bytes, and caches them. Spegel indexes that node's layers and announces them. When the operator's DaemonSet on every other node starts its pull, Spegel intercepts the request, sees that a peer already has the layer, and serves it over the cluster network instead. The registry sees one pull instead of a hundred. The cluster gets warm everywhere in the time it takes to copy bytes between two nodes over a 10Gbps NIC.

Stargz fits in as a third layer for the long tail. Workloads that don't fit your ImageCache declaration - because they're new, or one-off, or some tenant pushed something the platform didn't know about - still start fast because Stargz makes the cold pull lazy. You've spent zero extra operational effort and you've turned the cold-start tail latency from a multi-minute outlier into a sub-second curve.

That's where the series lands. Issue #3 was the smartest single technique, Issue #1 the smartest distribution model, and this one is just the bluntest instrument in the drawer: have the bytes there already. Put all three together and the "first 60 seconds" problem from Issue #15 stops being a problem at all.

Links

Stargz Snapshotter
Spegel: cluster-local OCI registry mirror
Podo Stack: Issue #3 - Lazy Pull, Smart Scale (Stargz)
Podo Stack: Issue #1 - Spegel, Pixie, and why :latest is evil
Podo Stack: Issue #15 - a pod's first 60 seconds

Issue #21 picks up a parallel thread. If you can preload the bytes onto the node, the next question is what happens when the node itself is the image. Talos and the immutable-OS school of thought treat the whole host as an artifact you replace rather than configure. The arrow keeps shortening, and the cold start keeps moving upstream.

- Ilia

Podo Stack

Discussion about this post

Ready for more?

Podo Stack

Issue #020 - Image Preload Operator: zero-second cold start, even for 8GB images

pre-cache DaemonSet, image warm pool, Stargz callback, sub-second startup for 8GB images

💎 Hidden Gem: Image Preload Operator

A DaemonSet that pulls images you haven't asked for yet

What it's not

Why the bet pays

The thing nobody mentions

Links

🔬 Trace: how the warm cache actually fills

The first ImageCache we wrote

Inside the Job

The kubectl describe view, with and without

Where this falls over

Links

🆚 Showdown: Stargz vs Spegel vs Preload

Three bets, one problem

Where each wins, sharply

The hybrid that actually ships

Links

Discussion about this post

Ready for more?

The `kubectl describe` view, with and without