inference-cache

module
v0.0.0-...-ed02025 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 3, 2026 License: Apache-2.0

README

inference-cache

A Kubernetes-native cache plane for LLM inference.

Repository layout

One operator, split across two binaries plus the CRDs.

CRDs — the API

  • api/v1alpha1/ — Go types (CacheBackend; CachePolicy, CacheTenant, PromptTemplate, PDTopology, CacheIndex as they land) + generated deepcopy
  • config/ — generated CRD, RBAC, and sample manifests

inferencecache-controller (cmd/controller) — watches CRDs and provisions cache backends

  • cmd/controller/ — controller-runtime manager entrypoint
  • internal/controller/ — reconcilers
  • pkg/adapters/runtime/KVCacheRuntimeAdapters render the cache-server pod/service and inject engine/router pod config; the reconciler drives them through the in-package Registry

inferencecache-server (cmd/server) — gRPC policy server + cache-state index + metrics

  • cmd/server/ — gRPC + HTTP server entrypoint
  • pkg/server/ — gRPC service (LookupRoute, RenderTemplate, …), health, metrics
  • proto/ (+ generated stubs) — the gRPC contract
  • pkg/index/ — cache-state aggregator (CacheIndex)
  • pkg/render/ — mutable-slot prompt rendering engine (the wedge); importable library
  • pkg/adapters/engine/ — engine KV-event hook (feeds the index)

Sharedpkg/version/, hack/, dockerfiles/, .githooks/

Quick Start

make proto-gen
make build
make test

Run the server locally:

bin/server \
  --grpc-bind-address=:9090 \
  --http-bind-address=:8080 \
  --snapshot-bind-address=:8081 \
  --insecure-disable-auth   # local-dev only; production passes --allowed-controller-sa instead
curl -i http://localhost:8080/healthz   # liveness
curl -i http://localhost:8080/readyz    # readiness
curl -s http://localhost:8080/metrics   # Prometheus metrics (inferencecache_*)
curl -s http://localhost:8081/snapshot  # internal aggregate (auth-gated in production)
curl -i -XPOST http://localhost:8081/policy -d '{"version":3,"policies":[]}'  # controller push (auth-gated in production)

The server fails closed by default: omitting both --allowed-controller-sa and --insecure-disable-auth causes it to exit 2 with a stderr message. That keeps an operator who forgets the flag from accidentally shipping unauthenticated /snapshot + /policy endpoints on a real cluster.

Cluster Prerequisites

The controller serves admission webhooks — defaulting + validation for CacheBackend, plus a mutating Pod webhook that auto-injects the LMCache engine configuration into pods labeled to match a CacheBackend's spec.engineSelector — over TLS, so deploying it requires cert-manager v1.0+ in the target cluster. The default install (config/default) provisions a self-signed Issuer plus a Certificate for the webhook serving cert, and relies on cert-manager's cert-manager.io/inject-ca-from annotation to inject the CA bundle into the generated MutatingWebhookConfiguration and ValidatingWebhookConfiguration.

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml

Install

The default Kustomize overlay brings up both control-plane components:

  • inference-cache-controller-manager — the reconciler + admission webhooks.
  • inference-cache-server — the gRPC policy server (InferenceCache) and the HTTP /healthz, /readyz, /metrics probe surface, plus a dedicated controller-facing listener carrying /snapshot (controller read) and /policy (controller write), both gated by ServiceAccount bearer auth + a NetworkPolicy. Fronted by a ClusterIP Service inference-cache-server in the inference-cache-system namespace with named ports grpc:9090 (gRPC API), http:8080 (probes / metrics), and snapshot:8081 (controller-only /snapshot + /policy). The controller's CacheIndex poller scrapes http://inference-cache-server:8081/snapshot and the CachePolicy reconciler POSTs to http://inference-cache-server:8081/policy by default; both send the projected ServiceAccount token. Once both pods are Ready kubectl get cacheindex reports live cluster-wide cache state.
kubectl apply -k config/default
kubectl -n inference-cache-system wait --for=condition=Available deployment --all --timeout=180s
kubectl get cacheindex cluster-default -o yaml

Local Development Cluster

Create a kind cluster for controller development:

make dev-cluster

By default this creates or reuses a cluster named inference-cache. You can override the name and node image:

make dev-cluster KIND_CLUSTER=cache-dev KIND_NODE_IMAGE=kindest/node:v1.31.0

Common Targets

  • make build: build controller and server binaries
  • make test: run unit tests
  • make lint: run gofmt and go vet
  • make ci-lint: run golangci-lint
  • make proto-gen: regenerate protobuf Go code
  • make generate: regenerate Kubernetes deepcopy code
  • make manifests: regenerate CRD, RBAC, and webhook manifests
  • make image-build: build controller and server images

Documentation

Design docs live under docs/:

Contributor guide: CONTRIBUTING.md (layout, naming rule, push/PR gates).

License

Apache-2.0

Directories

Path Synopsis
api
v1alpha1
Package v1alpha1 contains API Schema definitions for the inferencecache v1alpha1 API group.
Package v1alpha1 contains API Schema definitions for the inferencecache v1alpha1 API group.
cmd
controller command
kvevent-subscriber command
Command kvevent-subscriber runs as a sidecar next to a vLLM engine replica: it subscribes to the engine's KV-cache events over ZMQ and reports cache state to the inferencecache-server over gRPC.
Command kvevent-subscriber runs as a sidecar next to a vLLM engine replica: it subscribes to the engine's KV-cache events over ZMQ and reports cache state to the inferencecache-server over gRPC.
server command
hack
verify-samples command
Package main is the verify-samples CI helper.
Package main is the verify-samples CI helper.
internal
webhook/pod
Package pod is the controller-owned mutating admission webhook that auto- wires user-provided inference engine pods to a matching cache backend — either a managed backend the controller provisions (LMCache today) or an External backend whose lifecycle the operator owns.
Package pod is the controller-owned mutating admission webhook that auto- wires user-provided inference engine pods to a matching cache backend — either a managed backend the controller provisions (LMCache today) or an External backend whose lifecycle the operator owns.
webhook/v1alpha1
Package v1alpha1 hosts the admission webhooks (defaulting + validating) for the operator-facing v1alpha1 CRDs: CacheBackend, CachePolicy, and CacheTenant.
Package v1alpha1 hosts the admission webhooks (defaulting + validating) for the operator-facing v1alpha1 CRDs: CacheBackend, CachePolicy, and CacheTenant.
pkg
adapters/engine
Package engine is the KV-event subscriber.
Package engine is the KV-event subscriber.
adapters/runtime
Package runtime is the controller-owned runtime-adapter seam: the plug-point that keeps engine-specific cache wiring out of the core CacheBackend reconciler.
Package runtime is the controller-owned runtime-adapter seam: the plug-point that keeps engine-specific cache wiring out of the core CacheBackend reconciler.
adapters/runtime/external
Package external is the runtime adapter for CacheBackend{type: External}: the controller does NOT provision pods for the cache, the operator points the CR at a pre-existing remote cache they manage themselves, and the adapter wires engine pods to that endpoint with the same engine wire format the managed-LMCache path uses (see pkg/adapters/runtime/internal/enginewire).
Package external is the runtime adapter for CacheBackend{type: External}: the controller does NOT provision pods for the cache, the operator points the CR at a pre-existing remote cache they manage themselves, and the adapter wires engine pods to that endpoint with the same engine wire format the managed-LMCache path uses (see pkg/adapters/runtime/internal/enginewire).
adapters/runtime/internal/enginewire
Package enginewire holds the engine-side wire format shared by every runtime adapter that fronts an LMCache-compatible cache (the in-tree vLLM+LMCache adapter and the External passthrough adapter today; future adapters that also speak the LMCache connector protocol can import it the same way).
Package enginewire holds the engine-side wire format shared by every runtime adapter that fronts an LMCache-compatible cache (the in-tree vLLM+LMCache adapter and the External passthrough adapter today; future adapters that also speak the LMCache connector protocol can import it the same way).
index
Package index is part of inferencecache-server: the cluster cache-state aggregator (the CacheIndex), populated from engine KV events and queried by LookupRoute.
Package index is part of inferencecache-server: the cluster cache-state aggregator (the CacheIndex), populated from engine KV events and queried by LookupRoute.
render
Package render is the mutable-slot prompt rendering engine (the "wedge"): it turns templated prompts into stable cache keys so a gateway's cache-aware routing matches on real prompts.
Package render is the mutable-slot prompt rendering engine (the "wedge"): it turns templated prompts into stable cache keys so a gateway's cache-aware routing matches on real prompts.
server/auth
Package auth provides HTTP middleware that gates the policy server's internal controller-facing endpoints (/snapshot and /policy) on a Kubernetes ServiceAccount bearer token.
Package auth provides HTTP middleware that gates the policy server's internal controller-facing endpoints (/snapshot and /policy) on a Kubernetes ServiceAccount bearer token.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL