Skip to content

Getting Started

PeerCache is the cross-node KV-cache transport for PD-disaggregated SGLang inference (separate prefill and decode workers). Prefill nodes publish KV pages; decode nodes read them back over RDMA. See Architecture for the data flow and copy counts.

Requirements

  • Python 3.9+
  • For the RDMA data plane: Linux with rdma-core / MLNX_OFED development headers (libibverbs, librdmacm), CMake ≥ 3.18, and a C++17 compiler.
  • For functional testing without RDMA: nothing extra — the pure-Python TCP fallback transport is used automatically.

Install

# Linux with RDMA NICs
pip install peercache            # once published to PyPI
# or from source
pip install git+https://github.com/flymysql/PeerCache.git

# Without RDMA (control plane + TCP fallback only, e.g. on a laptop / CI)
pip install -e . --config-settings=cmake.define.PEERCACHE_NO_RDMA=ON

PeerCache must be importable from the SGLang process:

python -c "import peercache; print(peercache.__version__)"

1. Pick the discovery head (embedded multi-master)

There is no separate meta process to launch, and no single point of failure. Pick one node's IP as the bootstrap head and set discovery_addr to it on every node. From there discovery is replicated automatically:

  • Every host runs the discovery service in-process.
  • The head is pinned as the primary master; as nodes join, the next hosts are promoted in hostname order to fill up to max_masters (default 3).
  • A non-head master that dies is replaced automatically; a cluster with fewer than max_masters hosts has all of them as masters. The registry is soft state, so a promoted/restarted master repopulates within one heartbeat.

So the only decision here is: which node's IP is the bootstrap head in discovery_addr. You may also give a comma-separated list of seeds ("ip1:31998,ip2:31998") so brand-new nodes can still bootstrap if the head is down.

Optional: to run a dedicated discovery host (e.g. a node that does not serve SGLang), start one with peercache-meta --bind 0.0.0.0:31998 and point discovery_addr at it.

2. Launch SGLang with the PeerCache backend

PeerCache plugs in through SGLang's dynamic backend mechanism — no SGLang source changes required. Use the same discovery_addr on all nodes.

# On the chosen discovery node, NODE0_IP is its own IP -> it hosts meta in-process.
# On every other node, the same NODE0_IP just points them at NODE0.
python -m sglang.launch_server \
  --model-path <model> \
  --enable-hierarchical-cache \
  --hicache-storage-backend dynamic \
  --hicache-storage-backend-extra-config '{
    "backend_name": "peercache",
    "module_path":  "peercache.store",
    "class_name":   "PeerCacheStore",
    "discovery_addr": "NODE0_IP:31998",
    "protocol": "rdma",
    "device_name": "mlx5_0",
    "ib_port": 1,
    "gid_index": 3,
    "global_segment_size": "8gb",
    "disk_enabled": true,
    "disk_path": "/data/peercache/",
    "disk_size": "100gb"
  }'

The disk tier (L4) is on by default. disk_path must be writable on every node (each uses a node_id subdirectory); point it at a large, fast local disk (NVMe). Set "disk_enabled": false to keep everything in the memory pool only. Total capacity per node ≈ global_segment_size (RAM) + disk_size (disk).

3. Centralized mode (optional — dedicated KV cache servers)

By default PeerCache is P2P: each SGLang node publishes KV into its own local pool and peers RDMA-read from the producer. Dedicated storage servers (peercache-storage-server) can run in the same cluster — use mode=hybrid (P2P + storage) or mode=centralized (storage only on inference nodes). Writes to storage use RDMA WRITE zero-copy (TCP fallback uses RPC copy when protocol=tcp).

Topology

  1. Launch one or more storage servers (no SGLang required):
peercache-storage-server \
  --discovery-addr NODE0_IP:31998 \
  --global-segment-size 64gb \
  --disk-path /data/peercache/ \
  --protocol rdma \
  --device-names mlx5_bond_1,mlx5_bond_2
  1. Launch SGLang inference workers (pick one mode):
# centralized: storage-only writes (no local published pool)
# hybrid:       storage writes + local P2P pool (same cluster as plain P2P nodes)
python -m sglang.launch_server \
  ... \
  --hicache-storage-backend-extra-config '{
    "backend_name": "peercache",
    "module_path":  "peercache.store",
    "class_name":   "PeerCacheStore",
    "discovery_addr": "NODE0_IP:31998",
    "mode": "hybrid",
    "protocol": "rdma",
    "device_name": "mlx5_0"
  }'

How it differs from P2P

P2P (default) Hybrid Centralized
KV bytes Stay on the producing inference node Storage servers (+ local pool copy) Live on storage servers
Storage servers Optional (ignored for writes) Same cluster — writes go to storage Required
Directory Sharded across all nodes Sharded across all nodes (unified lookup) Sharded across all nodes
Write path Local memcpy + directory PUT local: same as P2P · storage: RDMA WRITE · both: both RDMA WRITE to storage
Read path One-sided RDMA READ (unchanged) RDMA READ from storage or local pool RDMA READ from storage
Extra process None peercache-storage-server (optional) peercache-storage-server

Inference nodes do not allocate a local published pool in centralized mode; size global_segment_size on storage servers instead. Discovery is unchanged — storage and inference nodes share the same discovery_addr.

Deployment topology (PD disaggregation)

A typical PD-disaggregated cluster:

flowchart LR
    subgraph prefill [Prefill pool]
      PF0["node-0 (discovery head + master)"]
      PF1[node-1]
    end
    subgraph decode [Decode pool]
      DC0[node-2]
      DC1[node-3]
    end
    PF1 & DC0 & DC1 -. register/heartbeat .-> PF0
    DC0 & DC1 ==>|RDMA READ KV| PF0 & PF1

Rules of thumb:

  • Run the same PeerCache backend config on every prefill and decode node, with the same discovery_addr everywhere.
  • Pick one node's IP for discovery_addr (any reachable node — often a prefill node). It is the pinned head; every host runs a discovery master and up to max_masters are active, so there is no single meta to lose. Nothing else to launch.
  • Size global_segment_size to how much published KV each node should keep resident (it is sliced across tp_size); larger pool = higher hit rate, more host RAM pinned.
  • Use protocol: rdma in production; protocol: tcp only for functional testing.
  • All nodes must be able to reach each other on the RDMA and control ports (rdma_port / control_port, auto-assigned by default) plus the discovery port.

extra_config reference

Required keys (the dynamic-backend factory needs the first three):

key default meaning
backend_name must be peercache (required by the dynamic factory)
module_path peercache.store (required)
class_name PeerCacheStore (required)
discovery_addr bootstrap head host:port (or a comma-separated seed list), same on all nodes; the head is pinned as the primary discovery master, every host runs a master (required)

Deployment mode:

key default meaning
mode p2p p2p (default), hybrid (P2P + storage in one cluster), or centralized (inference nodes are storage clients)
write_policy local Hybrid only: local (default — P2P publish only), storage (RDMA WRITE to storage servers), or both (dual write: storage + local copy)
role auto inference (SGLang client), storage (KV server — use peercache-storage-server), or auto (infer from process)

RDMA / transport:

key default meaning
protocol rdma rdma (production) or tcp (fallback transport for testing)
device_name "" RDMA device, e.g. mlx5_0; empty = first active device
ib_port 1 HCA port
gid_index 3 GID index (RoCE v2 is typically 3)
max_channels_per_peer 16 max concurrent data-plane channels per peer (RDMA QP+CQ, or TCP sockets in fallback). Bounds parallel readers to one peer; extra threads briefly wait for a free channel

Capacity / placement:

key default meaning
global_segment_size 4gb published-pool (memory) size per node (accepts int or "8gb"/"512mb"; sliced across tp_size)
vnodes 160 virtual nodes per node on the consistent-hash ring
directory_replicas 2 replicate directory entries to N owners so a single node loss doesn't drop entries before the re-shard completes
directory_read_cache_ttl 0 cache resolved resident read locations for N seconds to skip the per-batch directory lookup on hot, static working sets (0 = off; invalidated on a read miss)
max_masters 3 number of discovery masters; the head plus the next hosts (hostname order) are masters, auto-promoted on failure (smaller clusters use all hosts)

Disk persistence tier (L4):

key default meaning
disk_enabled true spill evicted pages to disk and promote them back on read (degrades gracefully if disk_path can't be created)
disk_path /data/peercache/ data directory (each node uses a node_id subdir)
disk_size 100gb disk capacity per node (LRU-bounded; accepts int or "100gb")

Monitoring (metrics + dashboard):

key default meaning
metrics_enabled true run the metrics server (Prometheus /metrics + dashboard)
metrics_port 31997 metrics/dashboard HTTP port (disabled if already bound, e.g. co-located ranks)
metrics_bind_host 0.0.0.0 metrics server bind interface
metrics_dashboard true also serve the built-in HTML dashboard at /

Networking / identity (rarely need changing):

key default meaning
meta_bind_host 0.0.0.0 interface the embedded discovery master binds (every host runs one on the meta port)
local_hostname auto IP advertised to peers; auto-resolved as the local IP that can reach discovery_addr
rdma_bind_host 0.0.0.0 bind interface for the RDMA data plane
rdma_port 0 RDMA bootstrap port; 0 = auto-assign
control_bind_host 0.0.0.0 bind interface for the control RPC server
control_port 0 control RPC port; 0 = auto-assign
node_id auto stable node identifier; auto-generated from local_hostname + random suffix
heartbeat_interval 2.0 seconds between membership heartbeats
member_ttl 6.0 seconds before a silent node is pruned by a master

Persistence and monitoring

With the disk tier on (default), each node spills published pages to disk_path/<node_id>/ and promotes them back on read, so capacity is effectively memory (global_segment_size) + disk (disk_size). See Architecture → Disk persistence tier.

Each node also serves metrics by default:

# Prometheus scrape target
curl http://NODE_IP:31997/metrics
# Built-in dashboard in a browser
open http://NODE_IP:31997/

Point Prometheus at NODE_IP:31997 (or scrape every node) to chart hit rate, throughput, latency p50/p99, and pool/disk usage. See Architecture → Monitoring.

TCP fallback (no RDMA)

Set "protocol": "tcp" to validate the full discovery + directory + pool design without RDMA hardware. Data is still read remotely into the destination buffer, just over TCP instead of one-sided RDMA. Use this for functional testing only.

Run the tests

pip install pytest
pytest -q          # uses the TCP fallback; no RDMA hardware required