Getting Started¶
PeerCache is the cross-node KV-cache transport for PD-disaggregated SGLang inference (separate prefill and decode workers). Prefill nodes publish KV pages; decode nodes read them back over RDMA. See Architecture for the data flow and copy counts.
Requirements¶
- Python 3.9+
- For the RDMA data plane: Linux with
rdma-core/ MLNX_OFED development headers (libibverbs,librdmacm), CMake ≥ 3.18, and a C++17 compiler. - For functional testing without RDMA: nothing extra — the pure-Python TCP fallback transport is used automatically.
Install¶
# Linux with RDMA NICs
pip install peercache # once published to PyPI
# or from source
pip install git+https://github.com/flymysql/PeerCache.git
# Without RDMA (control plane + TCP fallback only, e.g. on a laptop / CI)
pip install -e . --config-settings=cmake.define.PEERCACHE_NO_RDMA=ON
PeerCache must be importable from the SGLang process:
1. Pick the discovery head (embedded multi-master)¶
There is no separate meta process to launch, and no single point of
failure. Pick one node's IP as the bootstrap head and set discovery_addr
to it on every node. From there discovery is replicated automatically:
- Every host runs the discovery service in-process.
- The head is pinned as the primary master; as nodes join, the next hosts are
promoted in hostname order to fill up to
max_masters(default 3). - A non-head master that dies is replaced automatically; a cluster with fewer than
max_mastershosts has all of them as masters. The registry is soft state, so a promoted/restarted master repopulates within one heartbeat.
So the only decision here is: which node's IP is the bootstrap head in
discovery_addr. You may also give a comma-separated list of seeds
("ip1:31998,ip2:31998") so brand-new nodes can still bootstrap if the head is
down.
Optional: to run a dedicated discovery host (e.g. a node that does not serve SGLang), start one with
peercache-meta --bind 0.0.0.0:31998and pointdiscovery_addrat it.
2. Launch SGLang with the PeerCache backend¶
PeerCache plugs in through SGLang's dynamic backend mechanism — no SGLang
source changes required. Use the same discovery_addr on all nodes.
# On the chosen discovery node, NODE0_IP is its own IP -> it hosts meta in-process.
# On every other node, the same NODE0_IP just points them at NODE0.
python -m sglang.launch_server \
--model-path <model> \
--enable-hierarchical-cache \
--hicache-storage-backend dynamic \
--hicache-storage-backend-extra-config '{
"backend_name": "peercache",
"module_path": "peercache.store",
"class_name": "PeerCacheStore",
"discovery_addr": "NODE0_IP:31998",
"protocol": "rdma",
"device_name": "mlx5_0",
"ib_port": 1,
"gid_index": 3,
"global_segment_size": "8gb",
"disk_enabled": true,
"disk_path": "/data/peercache/",
"disk_size": "100gb"
}'
The disk tier (L4) is on by default.
disk_pathmust be writable on every node (each uses anode_idsubdirectory); point it at a large, fast local disk (NVMe). Set"disk_enabled": falseto keep everything in the memory pool only. Total capacity per node ≈global_segment_size(RAM) +disk_size(disk).
3. Centralized mode (optional — dedicated KV cache servers)¶
By default PeerCache is P2P: each SGLang node publishes KV into its own local
pool and peers RDMA-read from the producer. Dedicated storage servers
(peercache-storage-server) can run in the same cluster — use
mode=hybrid (P2P + storage) or mode=centralized (storage only on inference
nodes). Writes to storage use RDMA WRITE zero-copy (TCP fallback uses RPC
copy when protocol=tcp).
Topology
- Launch one or more storage servers (no SGLang required):
peercache-storage-server \
--discovery-addr NODE0_IP:31998 \
--global-segment-size 64gb \
--disk-path /data/peercache/ \
--protocol rdma \
--device-names mlx5_bond_1,mlx5_bond_2
- Launch SGLang inference workers (pick one mode):
# centralized: storage-only writes (no local published pool)
# hybrid: storage writes + local P2P pool (same cluster as plain P2P nodes)
python -m sglang.launch_server \
... \
--hicache-storage-backend-extra-config '{
"backend_name": "peercache",
"module_path": "peercache.store",
"class_name": "PeerCacheStore",
"discovery_addr": "NODE0_IP:31998",
"mode": "hybrid",
"protocol": "rdma",
"device_name": "mlx5_0"
}'
How it differs from P2P
| P2P (default) | Hybrid | Centralized | |
|---|---|---|---|
| KV bytes | Stay on the producing inference node | Storage servers (+ local pool copy) | Live on storage servers |
| Storage servers | Optional (ignored for writes) | Same cluster — writes go to storage | Required |
| Directory | Sharded across all nodes | Sharded across all nodes (unified lookup) | Sharded across all nodes |
| Write path | Local memcpy + directory PUT | local: same as P2P · storage: RDMA WRITE · both: both |
RDMA WRITE to storage |
| Read path | One-sided RDMA READ (unchanged) | RDMA READ from storage or local pool | RDMA READ from storage |
| Extra process | None | peercache-storage-server (optional) |
peercache-storage-server |
Inference nodes do not allocate a local published pool in centralized mode;
size global_segment_size on storage servers instead. Discovery is unchanged —
storage and inference nodes share the same discovery_addr.
Deployment topology (PD disaggregation)¶
A typical PD-disaggregated cluster:
flowchart LR
subgraph prefill [Prefill pool]
PF0["node-0 (discovery head + master)"]
PF1[node-1]
end
subgraph decode [Decode pool]
DC0[node-2]
DC1[node-3]
end
PF1 & DC0 & DC1 -. register/heartbeat .-> PF0
DC0 & DC1 ==>|RDMA READ KV| PF0 & PF1
Rules of thumb:
- Run the same PeerCache backend config on every prefill and decode node, with
the same
discovery_addreverywhere. - Pick one node's IP for
discovery_addr(any reachable node — often a prefill node). It is the pinned head; every host runs a discovery master and up tomax_mastersare active, so there is no single meta to lose. Nothing else to launch. - Size
global_segment_sizeto how much published KV each node should keep resident (it is sliced acrosstp_size); larger pool = higher hit rate, more host RAM pinned. - Use
protocol: rdmain production;protocol: tcponly for functional testing. - All nodes must be able to reach each other on the RDMA and control ports
(
rdma_port/control_port, auto-assigned by default) plus the discovery port.
extra_config reference¶
Required keys (the dynamic-backend factory needs the first three):
| key | default | meaning |
|---|---|---|
backend_name |
— | must be peercache (required by the dynamic factory) |
module_path |
— | peercache.store (required) |
class_name |
— | PeerCacheStore (required) |
discovery_addr |
— | bootstrap head host:port (or a comma-separated seed list), same on all nodes; the head is pinned as the primary discovery master, every host runs a master (required) |
Deployment mode:
| key | default | meaning |
|---|---|---|
mode |
p2p |
p2p (default), hybrid (P2P + storage in one cluster), or centralized (inference nodes are storage clients) |
write_policy |
local |
Hybrid only: local (default — P2P publish only), storage (RDMA WRITE to storage servers), or both (dual write: storage + local copy) |
role |
auto |
inference (SGLang client), storage (KV server — use peercache-storage-server), or auto (infer from process) |
RDMA / transport:
| key | default | meaning |
|---|---|---|
protocol |
rdma |
rdma (production) or tcp (fallback transport for testing) |
device_name |
"" |
RDMA device, e.g. mlx5_0; empty = first active device |
ib_port |
1 |
HCA port |
gid_index |
3 |
GID index (RoCE v2 is typically 3) |
max_channels_per_peer |
16 |
max concurrent data-plane channels per peer (RDMA QP+CQ, or TCP sockets in fallback). Bounds parallel readers to one peer; extra threads briefly wait for a free channel |
Capacity / placement:
| key | default | meaning |
|---|---|---|
global_segment_size |
4gb |
published-pool (memory) size per node (accepts int or "8gb"/"512mb"; sliced across tp_size) |
vnodes |
160 |
virtual nodes per node on the consistent-hash ring |
directory_replicas |
2 |
replicate directory entries to N owners so a single node loss doesn't drop entries before the re-shard completes |
directory_read_cache_ttl |
0 |
cache resolved resident read locations for N seconds to skip the per-batch directory lookup on hot, static working sets (0 = off; invalidated on a read miss) |
max_masters |
3 |
number of discovery masters; the head plus the next hosts (hostname order) are masters, auto-promoted on failure (smaller clusters use all hosts) |
Disk persistence tier (L4):
| key | default | meaning |
|---|---|---|
disk_enabled |
true |
spill evicted pages to disk and promote them back on read (degrades gracefully if disk_path can't be created) |
disk_path |
/data/peercache/ |
data directory (each node uses a node_id subdir) |
disk_size |
100gb |
disk capacity per node (LRU-bounded; accepts int or "100gb") |
Monitoring (metrics + dashboard):
| key | default | meaning |
|---|---|---|
metrics_enabled |
true |
run the metrics server (Prometheus /metrics + dashboard) |
metrics_port |
31997 |
metrics/dashboard HTTP port (disabled if already bound, e.g. co-located ranks) |
metrics_bind_host |
0.0.0.0 |
metrics server bind interface |
metrics_dashboard |
true |
also serve the built-in HTML dashboard at / |
Networking / identity (rarely need changing):
| key | default | meaning |
|---|---|---|
meta_bind_host |
0.0.0.0 |
interface the embedded discovery master binds (every host runs one on the meta port) |
local_hostname |
auto | IP advertised to peers; auto-resolved as the local IP that can reach discovery_addr |
rdma_bind_host |
0.0.0.0 |
bind interface for the RDMA data plane |
rdma_port |
0 |
RDMA bootstrap port; 0 = auto-assign |
control_bind_host |
0.0.0.0 |
bind interface for the control RPC server |
control_port |
0 |
control RPC port; 0 = auto-assign |
node_id |
auto | stable node identifier; auto-generated from local_hostname + random suffix |
heartbeat_interval |
2.0 |
seconds between membership heartbeats |
member_ttl |
6.0 |
seconds before a silent node is pruned by a master |
Persistence and monitoring¶
With the disk tier on (default), each node spills published pages to
disk_path/<node_id>/ and promotes them back on read, so capacity is effectively
memory (global_segment_size) + disk (disk_size). See
Architecture → Disk persistence tier.
Each node also serves metrics by default:
# Prometheus scrape target
curl http://NODE_IP:31997/metrics
# Built-in dashboard in a browser
open http://NODE_IP:31997/
Point Prometheus at NODE_IP:31997 (or scrape every node) to chart hit rate,
throughput, latency p50/p99, and pool/disk usage. See
Architecture → Monitoring.
TCP fallback (no RDMA)¶
Set "protocol": "tcp" to validate the full discovery + directory + pool design
without RDMA hardware. Data is still read remotely into the destination buffer,
just over TCP instead of one-sided RDMA. Use this for functional testing only.