Demo: multi-node prefix-cache reuse¶

A step-by-step walkthrough that brings up 4 SGLang nodes sharing one prefix / KV cache through PeerCache (aggregated, non-PD) and then proves the cross-node cache hits with metrics.

What this demo shows

Several inference nodes computing a shared prefix only once and reusing its KV across the cluster over RDMA — no central master, no PD transfer engine. You will watch PeerCache's write_requests, read_hits and read_remote_hits counters climb.

Topology¶

Role	Host	Notes
Node 1	`<NODE1_IP>`	the pinned discovery head (every host runs a master) and the router
Node 2	`<NODE2_IP>`
Node 3	`<NODE3_IP>`
Node 4	`<NODE4_IP>`

flowchart LR
    C[client / bench] --> R["router :8000<br/>(round-robin)"]
    R --> S1["SGLang :30000<br/>Node 1"]
    R --> S2["SGLang :30000<br/>Node 2"]
    R --> S3["SGLang :30000<br/>Node 3"]
    R --> S4["SGLang :30000<br/>Node 4"]
    S1 <== "one-sided RDMA READ<br/>(shared prefix KV)" ==> S2
    S3 <== "PeerCache" ==> S4

Prerequisites¶

4 hosts with an RDMA NIC (RoCE/IB), mutually reachable. Note your device name(s), e.g. mlx5_0 or mlx5_bond_1..8 — and use the same order on every node (rails pair by index).
SGLang installed (with --enable-hierarchical-cache support).
A model and a tokenizer available at the same path on every node.
TCP reachability between nodes on the discovery port (31998), the server port (30000), the router port (8000) and the metrics port (31997).

Step 1 — install PeerCache on all 4 nodes¶

pip install -U peercache            # RDMA build (needs libibverbs/librdmacm)
python -c "from peercache import _peercache as m; assert m.HAS_RDMA, 'expected RDMA build'"
ulimit -l unlimited                 # allow pinning RDMA memory

Step 2 — set the shared variables on every node¶

Set these in each node's shell. Only SELF differs per node.

MODEL=/path/to/your/model                 # same path on every node
DEVS=mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8
DISC=<NODE1_IP>:31998                      # SAME on all nodes; Node 1 is the pinned discovery head
SELF=<this node's IP>                      # e.g. <NODE1_IP> on node 1, <NODE2_IP> on node 2 ...
TP=1                                       # raise if the model needs >1 GPU

PC='{"backend_name":"peercache","module_path":"peercache.store","class_name":"PeerCacheStore","discovery_addr":"'$DISC'","protocol":"rdma","device_names":"'$DEVS'","local_hostname":"'$SELF'","global_segment_size":"16gb","disk_path":"/data/peercache/","disk_size":"100gb"}'

Single NIC?

Replace "device_names":"'$DEVS'" with "device_name":"mlx5_0" (and skip DEVS). Multi-rail (device_names) stripes across NICs for more bandwidth.

Step 3 — start one SGLang server per node¶

Run this on all four nodes (start Node 1 first — it is the discovery head that the others bootstrap from).

pkill -9 -f sglang; sleep 2                       # free any stale GPU memory
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
  --model-path $MODEL --tp-size $TP --trust-remote-code \
  --host 0.0.0.0 --port 30000 \
  --enable-hierarchical-cache \
  --hicache-write-policy write_through \
  --hicache-ratio 1.2 \
  --hicache-storage-backend dynamic \
  --hicache-storage-backend-extra-config "$PC"

Why these flags matter for the demo:

--hicache-write-policy write_through — publish KV to PeerCache as it is produced, instead of only on host-cache eviction. Without this the L3 cache often stays empty (write_requests=0) when the host tier is large.
--hicache-ratio 1.2 — keep the host (L2) tier small so pages actually flow down to the PeerCache (L3) tier.

In each server log you should see PeerCache come up:

This host runs an embedded PeerCache discovery master on 0.0.0.0:31998 (seeds=['<NODE1_IP>:31998'], max_masters=3).   (every node)
PeerCacheStore up: node=<ip>-xxxx rdma=<ip>:<port> control=<ip>:<port> discovery=<NODE1_IP>:31998
PeerCacheStore registered MRs: recv=... bytes, pool=17179869184 bytes

Must NOT say 'using TCP fallback'

If you see RDMA transport unavailable ... using TCP fallback, RDMA didn't initialise — fix the device name / GID before continuing (TCP is functional but not a performance path). Also confirm the registered MRs ... pool=... line appears; if it's missing, PeerCache never received the host pool.

Step 4 — start the router on Node 1¶

python -m sglang_router.launch_server \
  --worker-urls http://<NODE1_IP>:30000 http://<NODE2_IP>:30000 \
                http://<NODE3_IP>:30000 http://<NODE4_IP>:30000 \
  --host 0.0.0.0 --port 8000 \
  --policy round_robin

Why round-robin for the demo

round_robin spreads requests that share a prefix across different nodes, so node B must fetch node A's prefix KV through PeerCache — exactly the cross-node path we want to observe (read_remote_hits). In production you'd usually pick cache_aware for the best latency (it keeps reuse local); PeerCache then mainly serves spillover and rebalancing.

sglang_router flags vary by version — check python -m sglang_router.launch_server --help.

Step 5 — drive a prefix-heavy workload¶

The default ShareGPT prompts barely share prefixes, so use SGLang's generated-shared-prefix workload: groups of requests that share a long system prompt — the ideal prefix-cache test.

python -m sglang.bench_serving --backend sglang \
  --host <NODE1_IP> --port 8000 --model $MODEL \
  --dataset-name generated-shared-prefix \
  --gsp-num-groups 64 --gsp-prompts-per-group 16 \
  --gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \
  --num-prompts 1024 --request-rate 8

(With your own dataset, e.g. ShareGPT, use --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json; just expect lower hit rates because the prefixes overlap less.)

Step 6 — confirm the cache hits¶

Scrape every node's metrics and look at the PeerCache counters:

for ip in <NODE1_IP> <NODE2_IP> <NODE3_IP> <NODE4_IP>; do
  echo "== $ip =="
  curl -s http://$ip:31997/metrics | grep -E \
    'peercache_(members|write_requests|read_requests|read_hits|read_remote_hits|bytes_read|pool_keys)\b'
done

What "working" looks like:

Counter	Meaning	Expected
`members`	nodes in the ring	4
`write_requests` / `pool_keys`	KV pages published to L3	> 0 and growing
`read_requests` / `read_hits`	L3 lookups that hit	> 0
`read_remote_hits`	hits served from another node over RDMA	> 0 ← the cross-node win
`bytes_read`	bytes pulled over RDMA	> 0
`rdma_read_timeouts` / `rdma_channel_discards`	data-plane errors	0
`rdma_read_wc_errors` / `rdma_last_wc_status`	READs that completed with an error	0 / 0

A second pass of the same workload should show a higher hit rate (the prefixes are now cached cluster-wide).

Verifying the benefit (A/B)¶

To quantify what PeerCache buys you, run the same workload twice:

Baseline — start the servers without the three --hicache-* flags (plain local radix cache only), run Step 5, record TTFT / throughput.
With PeerCache — the Step 3 command above, run Step 5 again.

Compare median/P99 TTFT and input-token throughput: shared prefixes that hit the cluster cache skip prefill recompute, lowering TTFT and raising throughput.

Troubleshooting¶

Symptom	Cause / fix
Log shows `using TCP fallback`	RDMA didn't init — wrong `device_name` / `gid_index`; check `ibv_devinfo`, `show_gids`.
All PeerCache counters stay `0`	No L3 traffic. Add `--hicache-write-policy write_through`, lower `--hicache-ratio`, and use a shared-prefix workload (Step 5).
`pool_capacity_bytes = 0` / no `registered MRs` line	The host pool wasn't registered with the backend — confirm `--enable-hierarchical-cache` and the `dynamic` backend config; check the startup log.
`read_remote_hits` stays `0` (but `read_hits` > 0)	Reuse is staying local — switch the router to `--policy round_robin` so prefixes spread across nodes.
`timed out waiting for the producer` / ring < 4	Discovery unreachable — open TCP `31998` between nodes; ensure all use the same `discovery_addr`.
CUDA OOM at load	Stale process on the GPU — `pkill -9 -f sglang; nvidia-smi`; or raise `--tp-size`.
`rdma_read_timeouts` climbing	Fabric/GID/loopback issue — verify RoCEv2 GID and cross-node RDMA (`ib_read_bw`).
`read_failures` high, `rdma_read_wc_errors` > 0	Cross-node READs complete with an error. Check `rdma_last_wc_status`: 10 (remote access error) = bad rkey/MR/bounds; 12/13 (RNR/retry-exceeded) = GID/MTU/path — fix `gid_index` (RoCEv2), verify `ib_read_bw` between the two nodes. The exact `ibv_wc_status_str` is also printed to the server log.
`read_failures` high but `rdma_read_wc_errors` = 0 and `rdma_read_timeouts` = 0	The READ never reached the wire. Check `rdma_local_reg_misses` (local destination outside a registered MR — the read buffer isn't part of the registered host KV pool), `rdma_post_failures`, `rdma_lease_failures`.

Recap¶

install peercache → set vars (SELF per node) → start 4 servers (Node 1 first)
→ start round-robin router on Node 1 → run generated-shared-prefix load
→ watch read_remote_hits / write_requests on :31997/metrics

That cross-node read_remote_hits is PeerCache's core value: compute a prefix once, reuse it everywhere over RDMA. See Positioning & comparison for when this pays off.