Skip to content

Demo: multi-node prefix-cache reuse

A step-by-step walkthrough that brings up 4 SGLang nodes sharing one prefix / KV cache through PeerCache (aggregated, non-PD) and then proves the cross-node cache hits with metrics.

What this demo shows

Several inference nodes computing a shared prefix only once and reusing its KV across the cluster over RDMA — no central master, no PD transfer engine. You will watch PeerCache's write_requests, read_hits and read_remote_hits counters climb.

Topology

Role Host Notes
Node 1 <NODE1_IP> the pinned discovery head (every host runs a master) and the router
Node 2 <NODE2_IP>
Node 3 <NODE3_IP>
Node 4 <NODE4_IP>
flowchart LR
    C[client / bench] --> R["router :8000<br/>(round-robin)"]
    R --> S1["SGLang :30000<br/>Node 1"]
    R --> S2["SGLang :30000<br/>Node 2"]
    R --> S3["SGLang :30000<br/>Node 3"]
    R --> S4["SGLang :30000<br/>Node 4"]
    S1 <== "one-sided RDMA READ<br/>(shared prefix KV)" ==> S2
    S3 <== "PeerCache" ==> S4

Prerequisites

  • 4 hosts with an RDMA NIC (RoCE/IB), mutually reachable. Note your device name(s), e.g. mlx5_0 or mlx5_bond_1..8 — and use the same order on every node (rails pair by index).
  • SGLang installed (with --enable-hierarchical-cache support).
  • A model and a tokenizer available at the same path on every node.
  • TCP reachability between nodes on the discovery port (31998), the server port (30000), the router port (8000) and the metrics port (31997).

Step 1 — install PeerCache on all 4 nodes

pip install -U peercache            # RDMA build (needs libibverbs/librdmacm)
python -c "from peercache import _peercache as m; assert m.HAS_RDMA, 'expected RDMA build'"
ulimit -l unlimited                 # allow pinning RDMA memory

Step 2 — set the shared variables on every node

Set these in each node's shell. Only SELF differs per node.

MODEL=/path/to/your/model                 # same path on every node
DEVS=mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8
DISC=<NODE1_IP>:31998                      # SAME on all nodes; Node 1 is the pinned discovery head
SELF=<this node's IP>                      # e.g. <NODE1_IP> on node 1, <NODE2_IP> on node 2 ...
TP=1                                       # raise if the model needs >1 GPU

PC='{"backend_name":"peercache","module_path":"peercache.store","class_name":"PeerCacheStore","discovery_addr":"'$DISC'","protocol":"rdma","device_names":"'$DEVS'","local_hostname":"'$SELF'","global_segment_size":"16gb","disk_path":"/data/peercache/","disk_size":"100gb"}'

Single NIC?

Replace "device_names":"'$DEVS'" with "device_name":"mlx5_0" (and skip DEVS). Multi-rail (device_names) stripes across NICs for more bandwidth.

Step 3 — start one SGLang server per node

Run this on all four nodes (start Node 1 first — it is the discovery head that the others bootstrap from).

pkill -9 -f sglang; sleep 2                       # free any stale GPU memory
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
  --model-path $MODEL --tp-size $TP --trust-remote-code \
  --host 0.0.0.0 --port 30000 \
  --enable-hierarchical-cache \
  --hicache-write-policy write_through \
  --hicache-ratio 1.2 \
  --hicache-storage-backend dynamic \
  --hicache-storage-backend-extra-config "$PC"

Why these flags matter for the demo:

  • --hicache-write-policy write_throughpublish KV to PeerCache as it is produced, instead of only on host-cache eviction. Without this the L3 cache often stays empty (write_requests=0) when the host tier is large.
  • --hicache-ratio 1.2 — keep the host (L2) tier small so pages actually flow down to the PeerCache (L3) tier.

In each server log you should see PeerCache come up:

This host runs an embedded PeerCache discovery master on 0.0.0.0:31998 (seeds=['<NODE1_IP>:31998'], max_masters=3).   (every node)
PeerCacheStore up: node=<ip>-xxxx rdma=<ip>:<port> control=<ip>:<port> discovery=<NODE1_IP>:31998
PeerCacheStore registered MRs: recv=... bytes, pool=17179869184 bytes

Must NOT say 'using TCP fallback'

If you see RDMA transport unavailable ... using TCP fallback, RDMA didn't initialise — fix the device name / GID before continuing (TCP is functional but not a performance path). Also confirm the registered MRs ... pool=... line appears; if it's missing, PeerCache never received the host pool.

Step 4 — start the router on Node 1

python -m sglang_router.launch_server \
  --worker-urls http://<NODE1_IP>:30000 http://<NODE2_IP>:30000 \
                http://<NODE3_IP>:30000 http://<NODE4_IP>:30000 \
  --host 0.0.0.0 --port 8000 \
  --policy round_robin

Why round-robin for the demo

round_robin spreads requests that share a prefix across different nodes, so node B must fetch node A's prefix KV through PeerCache — exactly the cross-node path we want to observe (read_remote_hits). In production you'd usually pick cache_aware for the best latency (it keeps reuse local); PeerCache then mainly serves spillover and rebalancing.

sglang_router flags vary by version — check python -m sglang_router.launch_server --help.

Step 5 — drive a prefix-heavy workload

The default ShareGPT prompts barely share prefixes, so use SGLang's generated-shared-prefix workload: groups of requests that share a long system prompt — the ideal prefix-cache test.

python -m sglang.bench_serving --backend sglang \
  --host <NODE1_IP> --port 8000 --model $MODEL \
  --dataset-name generated-shared-prefix \
  --gsp-num-groups 64 --gsp-prompts-per-group 16 \
  --gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \
  --num-prompts 1024 --request-rate 8

(With your own dataset, e.g. ShareGPT, use --dataset-name sharegpt --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json; just expect lower hit rates because the prefixes overlap less.)

Step 6 — confirm the cache hits

Scrape every node's metrics and look at the PeerCache counters:

for ip in <NODE1_IP> <NODE2_IP> <NODE3_IP> <NODE4_IP>; do
  echo "== $ip =="
  curl -s http://$ip:31997/metrics | grep -E \
    'peercache_(members|write_requests|read_requests|read_hits|read_remote_hits|bytes_read|pool_keys)\b'
done

What "working" looks like:

Counter Meaning Expected
members nodes in the ring 4
write_requests / pool_keys KV pages published to L3 > 0 and growing
read_requests / read_hits L3 lookups that hit > 0
read_remote_hits hits served from another node over RDMA > 0 ← the cross-node win
bytes_read bytes pulled over RDMA > 0
rdma_read_timeouts / rdma_channel_discards data-plane errors 0
rdma_read_wc_errors / rdma_last_wc_status READs that completed with an error 0 / 0

A second pass of the same workload should show a higher hit rate (the prefixes are now cached cluster-wide).

Verifying the benefit (A/B)

To quantify what PeerCache buys you, run the same workload twice:

  1. Baseline — start the servers without the three --hicache-* flags (plain local radix cache only), run Step 5, record TTFT / throughput.
  2. With PeerCache — the Step 3 command above, run Step 5 again.

Compare median/P99 TTFT and input-token throughput: shared prefixes that hit the cluster cache skip prefill recompute, lowering TTFT and raising throughput.

Troubleshooting

Symptom Cause / fix
Log shows using TCP fallback RDMA didn't init — wrong device_name / gid_index; check ibv_devinfo, show_gids.
All PeerCache counters stay 0 No L3 traffic. Add --hicache-write-policy write_through, lower --hicache-ratio, and use a shared-prefix workload (Step 5).
pool_capacity_bytes = 0 / no registered MRs line The host pool wasn't registered with the backend — confirm --enable-hierarchical-cache and the dynamic backend config; check the startup log.
read_remote_hits stays 0 (but read_hits > 0) Reuse is staying local — switch the router to --policy round_robin so prefixes spread across nodes.
timed out waiting for the producer / ring < 4 Discovery unreachable — open TCP 31998 between nodes; ensure all use the same discovery_addr.
CUDA OOM at load Stale process on the GPU — pkill -9 -f sglang; nvidia-smi; or raise --tp-size.
rdma_read_timeouts climbing Fabric/GID/loopback issue — verify RoCEv2 GID and cross-node RDMA (ib_read_bw).
read_failures high, rdma_read_wc_errors > 0 Cross-node READs complete with an error. Check rdma_last_wc_status: 10 (remote access error) = bad rkey/MR/bounds; 12/13 (RNR/retry-exceeded) = GID/MTU/path — fix gid_index (RoCEv2), verify ib_read_bw between the two nodes. The exact ibv_wc_status_str is also printed to the server log.
read_failures high but rdma_read_wc_errors = 0 and rdma_read_timeouts = 0 The READ never reached the wire. Check rdma_local_reg_misses (local destination outside a registered MR — the read buffer isn't part of the registered host KV pool), rdma_post_failures, rdma_lease_failures.

Recap

install peercache → set vars (SELF per node) → start 4 servers (Node 1 first)
→ start round-robin router on Node 1 → run generated-shared-prefix load
→ watch read_remote_hits / write_requests on :31997/metrics

That cross-node read_remote_hits is PeerCache's core value: compute a prefix once, reuse it everywhere over RDMA. See Positioning & comparison for when this pays off.