Getting Started¶

PeerCache is the cross-node KV-cache transport for PD-disaggregated SGLang inference (separate prefill and decode workers). Prefill nodes publish KV pages; decode nodes read them back over RDMA. See Architecture for the data flow and copy counts.

Requirements¶

Python 3.9+
For the RDMA data plane: Linux with rdma-core / MLNX_OFED development headers (libibverbs, librdmacm), CMake ≥ 3.18, and a C++17 compiler.
For functional testing without RDMA: nothing extra — the pure-Python TCP fallback transport is used automatically.

Install¶

# Linux with RDMA NICs
pip install peercache            # once published to PyPI
# or from source
pip install git+https://github.com/flymysql/PeerCache.git

# Without RDMA (control plane + TCP fallback only, e.g. on a laptop / CI)
pip install -e . --config-settings=cmake.define.PEERCACHE_NO_RDMA=ON

PeerCache must be importable from the SGLang process:

python -c "import peercache; print(peercache.__version__)"

1. Pick the discovery head (embedded multi-master)¶

There is no separate meta process to launch, and no single point of failure. Pick one node's IP as the bootstrap head and set discovery_addr to it on every node. From there discovery is replicated automatically:

Every host runs the discovery service in-process.
The head is pinned as the primary master; as nodes join, the next hosts are promoted in hostname order to fill up to max_masters (default 3).
A non-head master that dies is replaced automatically; a cluster with fewer than max_masters hosts has all of them as masters. The registry is soft state, so a promoted/restarted master repopulates within one heartbeat.

So the only decision here is: which node's IP is the bootstrap head in discovery_addr. You may also give a comma-separated list of seeds ("ip1:31998,ip2:31998") so brand-new nodes can still bootstrap if the head is down.

Optional: to run a dedicated discovery host (e.g. a node that does not serve SGLang), start one with peercache-meta --bind 0.0.0.0:31998 and point discovery_addr at it.

2. Launch SGLang with the PeerCache backend¶

PeerCache plugs in through SGLang's dynamic backend mechanism — no SGLang source changes required. Use the same discovery_addr on all nodes.

# On the chosen discovery node, NODE0_IP is its own IP -> it hosts meta in-process.
# On every other node, the same NODE0_IP just points them at NODE0.
python -m sglang.launch_server \
  --model-path <model> \
  --enable-hierarchical-cache \
  --hicache-storage-backend dynamic \
  --hicache-storage-backend-extra-config '{
    "backend_name": "peercache",
    "module_path":  "peercache.store",
    "class_name":   "PeerCacheStore",
    "discovery_addr": "NODE0_IP:31998",
    "protocol": "rdma",
    "device_name": "mlx5_0",
    "ib_port": 1,
    "gid_index": 3,
    "global_segment_size": "8gb",
    "disk_enabled": true,
    "disk_path": "/data/peercache/",
    "disk_size": "100gb"
  }'

The disk tier (L4) is on by default. disk_path must be writable on every node (each uses a node_id subdirectory); point it at a large, fast local disk (NVMe). Set "disk_enabled": false to keep everything in the memory pool only. Total capacity per node ≈ global_segment_size (RAM) + disk_size (disk).

3. Centralized mode (optional — dedicated KV cache servers)¶

By default PeerCache is P2P: each SGLang node publishes KV into its own local pool and peers RDMA-read from the producer. Dedicated storage servers (peercache-storage-server) can run in the same cluster — use mode=hybrid (P2P + storage) or mode=centralized (storage only on inference nodes). Writes to storage use RDMA WRITE zero-copy (TCP fallback uses RPC copy when protocol=tcp).

Topology

Launch one or more storage servers (no SGLang required):

peercache-storage-server \
  --discovery-addr NODE0_IP:31998 \
  --global-segment-size 64gb \
  --disk-path /data/peercache/ \
  --protocol rdma \
  --device-names mlx5_bond_1,mlx5_bond_2

Launch SGLang inference workers (pick one mode):

# centralized: storage-only writes (no local published pool)
# hybrid:       storage writes + local P2P pool (same cluster as plain P2P nodes)
python -m sglang.launch_server \
  ... \
  --hicache-storage-backend-extra-config '{
    "backend_name": "peercache",
    "module_path":  "peercache.store",
    "class_name":   "PeerCacheStore",
    "discovery_addr": "NODE0_IP:31998",
    "mode": "hybrid",
    "protocol": "rdma",
    "device_name": "mlx5_0"
  }'

How it differs from P2P

	P2P (default)	Hybrid	Centralized
KV bytes	Stay on the producing inference node	Storage servers (+ local pool copy)	Live on storage servers
Storage servers	Optional (ignored for writes)	Same cluster — writes go to storage	Required
Directory	Sharded across all nodes	Sharded across all nodes (unified lookup)	Sharded across all nodes
Write path	Local memcpy + directory PUT	`local`: same as P2P · `storage`: RDMA WRITE · `both`: both	RDMA WRITE to storage
Read path	One-sided RDMA READ (unchanged)	RDMA READ from storage or local pool	RDMA READ from storage
Extra process	None	`peercache-storage-server` (optional)	`peercache-storage-server`

Inference nodes do not allocate a local published pool in centralized mode; size global_segment_size on storage servers instead. Discovery is unchanged — storage and inference nodes share the same discovery_addr.

Deployment topology (PD disaggregation)¶

A typical PD-disaggregated cluster:

flowchart LR
    subgraph prefill [Prefill pool]
      PF0["node-0 (discovery head + master)"]
      PF1[node-1]
    end
    subgraph decode [Decode pool]
      DC0[node-2]
      DC1[node-3]
    end
    PF1 & DC0 & DC1 -. register/heartbeat .-> PF0
    DC0 & DC1 ==>|RDMA READ KV| PF0 & PF1

Rules of thumb:

Run the same PeerCache backend config on every prefill and decode node, with the same discovery_addr everywhere.
Pick one node's IP for discovery_addr (any reachable node — often a prefill node). It is the pinned head; every host runs a discovery master and up to max_masters are active, so there is no single meta to lose. Nothing else to launch.
Size global_segment_size to how much published KV each node should keep resident (it is sliced across tp_size); larger pool = higher hit rate, more host RAM pinned.
Use protocol: rdma in production; protocol: tcp only for functional testing.
All nodes must be able to reach each other on the RDMA and control ports (rdma_port / control_port, auto-assigned by default) plus the discovery port.

extra_config reference¶

Required keys (the dynamic-backend factory needs the first three):

key	default	meaning
`backend_name`	—	must be `peercache` (required by the dynamic factory)
`module_path`	—	`peercache.store` (required)
`class_name`	—	`PeerCacheStore` (required)
`discovery_addr`	—	bootstrap head `host:port` (or a comma-separated seed list), same on all nodes; the head is pinned as the primary discovery master, every host runs a master (required)

Deployment mode:

key	default	meaning
`mode`	`p2p`	`p2p` (default), `hybrid` (P2P + storage in one cluster), or `centralized` (inference nodes are storage clients)
`write_policy`	`local`	Hybrid only: `local` (default — P2P publish only), `storage` (RDMA WRITE to storage servers), or `both` (dual write: storage + local copy)
`role`	`auto`	`inference` (SGLang client), `storage` (KV server — use `peercache-storage-server`), or `auto` (infer from process)

RDMA / transport:

key	default	meaning
`protocol`	`rdma`	`rdma` (production) or `tcp` (fallback transport for testing)
`device_name`	`""`	RDMA device, e.g. `mlx5_0`; empty = first active device
`ib_port`	`1`	HCA port
`gid_index`	`3`	GID index (RoCE v2 is typically 3)
`max_channels_per_peer`	`16`	max concurrent data-plane channels per peer (RDMA QP+CQ, or TCP sockets in fallback). Bounds parallel readers to one peer; extra threads briefly wait for a free channel

Capacity / placement:

key	default	meaning
`global_segment_size`	`4gb`	published-pool (memory) size per node (accepts `int` or `"8gb"`/`"512mb"`; sliced across `tp_size`)
`vnodes`	`160`	virtual nodes per node on the consistent-hash ring
`directory_replicas`	`2`	replicate directory entries to N owners so a single node loss doesn't drop entries before the re-shard completes
`directory_read_cache_ttl`	`0`	cache resolved resident read locations for N seconds to skip the per-batch directory lookup on hot, static working sets (`0` = off; invalidated on a read miss)
`max_masters`	`3`	number of discovery masters; the head plus the next hosts (hostname order) are masters, auto-promoted on failure (smaller clusters use all hosts)

Disk persistence tier (L4):

key	default	meaning
`disk_enabled`	`true`	spill evicted pages to disk and promote them back on read (degrades gracefully if `disk_path` can't be created)
`disk_path`	`/data/peercache/`	data directory (each node uses a `node_id` subdir)
`disk_size`	`100gb`	disk capacity per node (LRU-bounded; accepts `int` or `"100gb"`)

Monitoring (metrics + dashboard):

key	default	meaning
`metrics_enabled`	`true`	run the metrics server (Prometheus `/metrics` + dashboard)
`metrics_port`	`31997`	metrics/dashboard HTTP port (disabled if already bound, e.g. co-located ranks)
`metrics_bind_host`	`0.0.0.0`	metrics server bind interface
`metrics_dashboard`	`true`	also serve the built-in HTML dashboard at `/`

Networking / identity (rarely need changing):

key	default	meaning
`meta_bind_host`	`0.0.0.0`	interface the embedded discovery master binds (every host runs one on the meta port)
`local_hostname`	auto	IP advertised to peers; auto-resolved as the local IP that can reach `discovery_addr`
`rdma_bind_host`	`0.0.0.0`	bind interface for the RDMA data plane
`rdma_port`	`0`	RDMA bootstrap port; `0` = auto-assign
`control_bind_host`	`0.0.0.0`	bind interface for the control RPC server
`control_port`	`0`	control RPC port; `0` = auto-assign
`node_id`	auto	stable node identifier; auto-generated from `local_hostname` + random suffix
`heartbeat_interval`	`2.0`	seconds between membership heartbeats
`member_ttl`	`6.0`	seconds before a silent node is pruned by a master

Persistence and monitoring¶

With the disk tier on (default), each node spills published pages to disk_path/<node_id>/ and promotes them back on read, so capacity is effectively memory (global_segment_size) + disk (disk_size). See Architecture → Disk persistence tier.

Each node also serves metrics by default:

# Prometheus scrape target
curl http://NODE_IP:31997/metrics
# Built-in dashboard in a browser
open http://NODE_IP:31997/

Point Prometheus at NODE_IP:31997 (or scrape every node) to chart hit rate, throughput, latency p50/p99, and pool/disk usage. See Architecture → Monitoring.

TCP fallback (no RDMA)¶

Set "protocol": "tcp" to validate the full discovery + directory + pool design without RDMA hardware. Data is still read remotely into the destination buffer, just over TCP instead of one-sided RDMA. Use this for functional testing only.

Run the tests¶

pip install pytest
pytest -q          # uses the TCP fallback; no RDMA hardware required