Positioning & comparison¶
This page is the "should I use PeerCache?" guide: what it is, how it differs from other KV caches, what it deliberately gives up, and where it fits.
What PeerCache is¶
PeerCache is a decentralized, peer-to-peer, RDMA zero-copy L3 (HiCache) storage backend for SGLang. Its one job is cross-request, cross-node KV (prefix) cache reuse: a producing node publishes KV pages into its own local pool plus a tiny location record into a consistent-hash directory sharded across all nodes; any node looks the key up and pulls the bytes with a one-sided RDMA READ straight into its registered buffer.
- No central master, no managed data pool. The directory is a DHT; the KV bytes stay on the node that produced them.
- Plugs into SGLang as
--hicache-storage-backend dynamic— no SGLang patch.
What PeerCache is not¶
- Not a PD transfer engine. It does not move the per-request prefill→decode
KV handoff; that latency-critical GPU→GPU path is what Mooncake / NIXL do via
--disaggregation-transfer-backend. PeerCache is orthogonal to it. - Not a centralized store (by default). P2P mode has no master / managed data
pool. Set
mode=centralizedto run dedicated storage servers (peercache-storage-server) that hold KV bytes and directory shards while inference nodes are clients — a Mooncake Store–like layout without a separate metadata master.
Two orthogonal axes — don't conflate them¶
| KV / prefix reuse (PeerCache) | PD P→D handoff (Mooncake/NIXL) | |
|---|---|---|
| Scope | Across requests / nodes | Within one request |
| Goal | Skip recomputing shared prefixes | Hand prefill's KV to decode |
| Latency | Cache-style; host staging OK | Latency-critical; GPU→GPU direct |
| SGLang knob | --hicache-storage-backend |
--disaggregation-transfer-backend |
A PD cluster typically uses both: PeerCache for prefix reuse on the prefill tier, and Mooncake/NIXL for the P→D handoff.
How PeerCache compares to centralized KV caches¶
Compared with master-coordinated / centralized-metadata KV stores (e.g. Mooncake Store, LMCache in distributed mode):
| Dimension | Centralized stores | PeerCache |
|---|---|---|
| Metadata | Central master / lookup service | Consistent-hash DHT, sharded over all nodes |
| Single point of failure | Master is a SPOF / bottleneck | No central metadata node |
| Metadata throughput | Bounded by the master | Scales with cluster size (~1/N per node) |
| Data placement | Often copied into a managed pool | Stays on the producing node |
| Write path | Pool insert + coordination | Local memcpy + one small location record |
| Read path | Through the store/engine | One-sided RDMA READ, zero copy |
| Services to run | Master + workers | Embedded discovery only (no separate master) |
| Scaling | Re-scale the coordinator | Add a node → ring re-shards automatically |
Advantages¶
- No metadata single point of failure or bottleneck. Every PUT/GET in a centralized design hits the master; PeerCache shards the directory, so metadata throughput grows with the cluster and there is no central hotspot.
- Light write path with data locality.
set()is a local memcpy plus a tiny directory record — no copy into a central pool. - Fewer moving parts to operate. Discovery is embedded and multi-master —
every host runs it, with the
discovery_addrhead pinned and up tomax_mastersactive — so there is no master to deploy, scale, or keep HA, and no single meta to lose. - Horizontal scaling without a coordinator. New nodes grow both capacity and metadata throughput; membership changes re-shard the directory automatically.
- Decentralized failure domain. Losing a node loses only its shard, not the
whole metadata service;
directory_replicas(default 2) keeps a replica. - Lean and SGLang-native. A compact C++ data plane + Python control plane,
dropped in via the
dynamicHiCache backend.
What we deliberately give up¶
Being honest about the trade-offs of a fully decentralized design:
- Maturity & ecosystem. Mooncake / LMCache are battle-tested at scale with richer eviction/tiering/observability and broad integrations. PeerCache is leaner and younger.
- Global placement decisions. A central master can make smarter global eviction / placement / load-balancing; PeerCache decides locally + by hash.
- Producer hotspots & data redundancy. KV bytes stay on the producing node, so a hot key can make that node a read hotspot, and data itself is not replicated by default — if the producer is down, that page is unavailable (the directory is replicated and a disk tier exists, but the KV bytes are not). A central pool spreads load and replicates data more easily.
When to use PeerCache (and when not)¶
Best fit
- Aggregated (non-PD) clusters with high prefix sharing — system prompts, few-shot, multi-turn chat history, RAG documents, agent contexts. PeerCache is the complete shared-cache layer here: no transfer engine, just plug it in.
- Teams that want Mooncake-Store-like reuse without running a central master
(P2P mode), or who prefer explicit storage servers (
mode=centralized).
Complementary
- PD-disaggregated clusters: add PeerCache on the prefill tier for cross-node prefix reuse (where SGLang's HiCache lives in PD), while Mooncake/NIXL still handle the P→D handoff.
Prefer the alternatives when
- You need a mature, feature-rich store with global scheduling and strong data redundancy, or
- You are already running Mooncake for PD and want its Store for reuse too, or
- Your workload has little prefix sharing (every request unique) — then any prefix cache, PeerCache included, adds little.
Decision guide¶
| Your situation | Recommendation |
|---|---|
| Want cross-node KV reuse, least complexity, no master | PeerCache P2P (aggregated mode) |
| Want dedicated KV pool servers, inference stays thin | PeerCache centralized (peercache-storage-server + mode=centralized) |
| Need P/D physical split for scaling / SLO | Mooncake/NIXL for handoff + PeerCache on prefill for reuse |
| Need global placement, rich features, strong data HA | A mature centralized store |
| Unique prompts, no shared prefixes | A KV reuse cache (any) won't help much |