Skip to content

Positioning & comparison

This page is the "should I use PeerCache?" guide: what it is, how it differs from other KV caches, what it deliberately gives up, and where it fits.

What PeerCache is

PeerCache is a decentralized, peer-to-peer, RDMA zero-copy L3 (HiCache) storage backend for SGLang. Its one job is cross-request, cross-node KV (prefix) cache reuse: a producing node publishes KV pages into its own local pool plus a tiny location record into a consistent-hash directory sharded across all nodes; any node looks the key up and pulls the bytes with a one-sided RDMA READ straight into its registered buffer.

  • No central master, no managed data pool. The directory is a DHT; the KV bytes stay on the node that produced them.
  • Plugs into SGLang as --hicache-storage-backend dynamic — no SGLang patch.

What PeerCache is not

  • Not a PD transfer engine. It does not move the per-request prefill→decode KV handoff; that latency-critical GPU→GPU path is what Mooncake / NIXL do via --disaggregation-transfer-backend. PeerCache is orthogonal to it.
  • Not a centralized store (by default). P2P mode has no master / managed data pool. Set mode=centralized to run dedicated storage servers (peercache-storage-server) that hold KV bytes and directory shards while inference nodes are clients — a Mooncake Store–like layout without a separate metadata master.

Two orthogonal axes — don't conflate them

KV / prefix reuse (PeerCache) PD P→D handoff (Mooncake/NIXL)
Scope Across requests / nodes Within one request
Goal Skip recomputing shared prefixes Hand prefill's KV to decode
Latency Cache-style; host staging OK Latency-critical; GPU→GPU direct
SGLang knob --hicache-storage-backend --disaggregation-transfer-backend

A PD cluster typically uses both: PeerCache for prefix reuse on the prefill tier, and Mooncake/NIXL for the P→D handoff.

How PeerCache compares to centralized KV caches

Compared with master-coordinated / centralized-metadata KV stores (e.g. Mooncake Store, LMCache in distributed mode):

Dimension Centralized stores PeerCache
Metadata Central master / lookup service Consistent-hash DHT, sharded over all nodes
Single point of failure Master is a SPOF / bottleneck No central metadata node
Metadata throughput Bounded by the master Scales with cluster size (~1/N per node)
Data placement Often copied into a managed pool Stays on the producing node
Write path Pool insert + coordination Local memcpy + one small location record
Read path Through the store/engine One-sided RDMA READ, zero copy
Services to run Master + workers Embedded discovery only (no separate master)
Scaling Re-scale the coordinator Add a node → ring re-shards automatically

Advantages

  • No metadata single point of failure or bottleneck. Every PUT/GET in a centralized design hits the master; PeerCache shards the directory, so metadata throughput grows with the cluster and there is no central hotspot.
  • Light write path with data locality. set() is a local memcpy plus a tiny directory record — no copy into a central pool.
  • Fewer moving parts to operate. Discovery is embedded and multi-master — every host runs it, with the discovery_addr head pinned and up to max_masters active — so there is no master to deploy, scale, or keep HA, and no single meta to lose.
  • Horizontal scaling without a coordinator. New nodes grow both capacity and metadata throughput; membership changes re-shard the directory automatically.
  • Decentralized failure domain. Losing a node loses only its shard, not the whole metadata service; directory_replicas (default 2) keeps a replica.
  • Lean and SGLang-native. A compact C++ data plane + Python control plane, dropped in via the dynamic HiCache backend.

What we deliberately give up

Being honest about the trade-offs of a fully decentralized design:

  • Maturity & ecosystem. Mooncake / LMCache are battle-tested at scale with richer eviction/tiering/observability and broad integrations. PeerCache is leaner and younger.
  • Global placement decisions. A central master can make smarter global eviction / placement / load-balancing; PeerCache decides locally + by hash.
  • Producer hotspots & data redundancy. KV bytes stay on the producing node, so a hot key can make that node a read hotspot, and data itself is not replicated by default — if the producer is down, that page is unavailable (the directory is replicated and a disk tier exists, but the KV bytes are not). A central pool spreads load and replicates data more easily.

When to use PeerCache (and when not)

Best fit

  • Aggregated (non-PD) clusters with high prefix sharing — system prompts, few-shot, multi-turn chat history, RAG documents, agent contexts. PeerCache is the complete shared-cache layer here: no transfer engine, just plug it in.
  • Teams that want Mooncake-Store-like reuse without running a central master (P2P mode), or who prefer explicit storage servers (mode=centralized).

Complementary

  • PD-disaggregated clusters: add PeerCache on the prefill tier for cross-node prefix reuse (where SGLang's HiCache lives in PD), while Mooncake/NIXL still handle the P→D handoff.

Prefer the alternatives when

  • You need a mature, feature-rich store with global scheduling and strong data redundancy, or
  • You are already running Mooncake for PD and want its Store for reuse too, or
  • Your workload has little prefix sharing (every request unique) — then any prefix cache, PeerCache included, adds little.

Decision guide

Your situation Recommendation
Want cross-node KV reuse, least complexity, no master PeerCache P2P (aggregated mode)
Want dedicated KV pool servers, inference stays thin PeerCache centralized (peercache-storage-server + mode=centralized)
Need P/D physical split for scaling / SLO Mooncake/NIXL for handoff + PeerCache on prefill for reuse
Need global placement, rich features, strong data HA A mature centralized store
Unique prompts, no shared prefixes A KV reuse cache (any) won't help much