Changelog¶
This project adheres to Semantic Versioning.
[0.8.2] - 2026-06-04¶
Added¶
- Hybrid
write_policy:local(default),storage, orboth.
[0.8.1] - 2026-06-04¶
Added¶
- RDMA WRITE zero-copy path to storage servers (
data_prepare_writes→batch_write_multi→data_commit_writes); RPC ingest fallback retained. mode=hybridfor P2P + storage servers in one cluster.
Changed¶
- Unified directory ring across all nodes; storage placement uses a storage ring.
[0.8.0] - 2026-06-04¶
Added¶
- Centralized mode (
mode=centralized) — dedicated KV cache servers viapeercache-storage-server; inference nodes use"mode": "centralized", "role": "inference". Writes viadata_ingestRPC; reads via RDMA READ. New config:mode,role;NodeInfo.role;storage_nodesmetric.
0.7.1 - 2026-06-02¶
Changed¶
- Multi-master discovery now pins the configured head (the
discovery_addrhost) as the primary master whenever it is alive — a stable, well-known bootstrap anchor — and fills the remaining master slots in hostname order as nodes join. If the head is down, the live hosts still fill all slots.
0.7.0 - 2026-06-02¶
Added¶
- Multi-master discovery — no single meta SPOF. Every host runs a discovery
server on the cluster-wide meta port; the active masters are the
max_masters(default 3) lowest-hostname live hosts, derived from membership. A dead master is replaced automatically, and a cluster with fewer thanmax_mastershosts has all of them as masters. Clients register/heartbeat to all current masters plus the configured bootstrap seeds (discovery_addrmay be a comma-separated list) and merge the membership; the soft-state registry repopulates within one heartbeat. Newmax_mastersconfig andDiscoveryClient.master_hosts(). Backward compatible with a singlediscovery_addr.
0.6.9 - 2026-06-02¶
Fixed¶
- Cross-node reads from SGLang's generic
batch_getnow transfer. The local READ destination SGLang passes can sit outside the registered host KV pool, solkey_for(addr)returned 0 and the work request was silently never posted (read_failuresclimbed with no completion error and no timeout).RdmaContextnow lazily registers and caches an MR (LOCAL_WRITE) for an unregistered destination range; SGLang reuses a bounded set of host pages, so the cache converges after first touch. Newrdma_lazy_local_mrsgauge.
0.6.8 - 2026-06-02¶
Added¶
- Pre-wire read-failure counters to tell "failed on the wire" from "never
posted":
rdma_local_reg_misses,rdma_post_failures,rdma_lease_failures.
0.6.7 - 2026-06-02¶
Added¶
- RDMA READ completion-error visibility.
drain()records the failingibv_wc_statusand logsibv_wc_status_str(rate-limited); newrdma_read_wc_errors/rdma_last_wc_statusgauges distinguish a remote-access error (bad rkey/MR, status 10) from retry-exceeded (GID/MTU/path, 12/13).
0.6.6 - 2026-06-02¶
Changed¶
- Heartbeat logging throttled to ~10s (membership/known-state changes still log immediately); the heartbeat cadence itself is unchanged.
0.6.5 - 2026-06-02¶
Fixed¶
batch_existsprobed the wrong keyspace, so reads never fired. SGLang's generic path writes one blob per raw key viabatch_set, butbatch_existslooked keys up as the suffixed K/V component keys used by the zero-copy v1/v2 path, so the prefetch probe missed every page (exists_pages_foundstayed 0 while writes climbed) and SGLang never issued aget.batch_exists/existsnow resolve keys through the active keyspace and self-heal on read-only nodes.
0.6.4 - 2026-06-02¶
Added¶
exists/ L3-prefetch observability:exists_requestsandexists_pages_foundmake the SGLang prefetch path visible end to end.
Changed¶
- Directory lookup reused across
exists→get.batch_existsprimes the resident hit locations into a one-shot, short-TTL handoff cache that the imminentbatch_getconsumes, skipping the redundant second directory RPC. Newdirectory_lookups_savedcounter.
0.6.3 - 2026-06-02¶
Changed¶
- Discovery registration polls the meta indefinitely instead of failing on a timeout — a node started before the meta no longer crashes the host process; it waits, logging periodically, and proceeds once the meta is up.
- Much more discovery logging for operability (node identity, master at startup, register/heartbeat/prune/membership-change events).
0.6.2 - 2026-06-02¶
Fixed¶
- Generic value-based
set/batch_set/get/batch_getnow work — SGLang's HiCache page-backup callsbatch_set(hash_values, data)and reads back viabatch_get(keys, dst_tensors); PeerCache previously only implemented the zero-copy form and crashed with anAssertionError. These now accept tensor-like objects, bytes, numpy arrays, or raw int ptrs. - The v2 registration path (
register_mem_host_pool_v2) never created the published pool (sopool_capacity_bytesstayed 0). v1 and v2 now share_ensure_published_pool()/_register_recv().
Added¶
- "Multi-node Demo" and "Positioning & comparison" docs pages (EN/中文).
0.6.1 - 2026-06-01¶
Fixed¶
- SGLang dynamic-backend registration on newer SGLang ("Backend class
PeerCacheStore must inherit from HiCacheStorage"):
HiCacheStorageis now imported on its own and the optional names degrade independently, soPeerCacheStorealways subclasses SGLang's real base.
Changed¶
- Refreshed the full-machine performance baseline: 8-NIC multi-process aggregate 273 → 413 GB/s (≈ 3.3 Tbps); added the GPUDirect result (49.5 GB/s) and the per-NIC range (25–89 GB/s).
0.6.0 - 2026-06-01¶
Added¶
- GPUDirect RDMA: the receive buffer may live in GPU memory (dmabuf via
ibv_reg_dmabuf_mr, else a plain MR of the device VA withnvidia-peermem);peercache-bench drive --gpumeasures it. - Config validation with actionable errors; data-plane gauges
(
read_failures,rdma_rails,rdma_read_timeouts,rdma_channel_discards); directory wire-format version onDataLocation. - Performance baseline docs page (EN/中文) with charts.
Changed¶
- Idempotent shutdown that deregisters first. Directory survives membership
changes (each producer re-publishes its pages on a ring change); directory
replication now defaults to 2 (
directory_replicas). Newdirectory_republishesmetric.
0.5.1 - 2026-05-31¶
Fixed¶
peercache-bench servere-publishes on ring-membership changes, so back-to-backdriveruns against one long-livedservework without a restart.
0.5.0 - 2026-05-31¶
Added¶
- Single-process multi-rail (multi-NIC) reads. One
PeerCacheStoreprocess opens one rail per device (device_names="mlx5_0,…") and stripes each batch of one-sided READs across all rails in one GIL-released call (TransferEngine::batch_read_multi), approaching the aggregate bandwidth of all NICs.DataLocationcarries per-railrail_endpoints[]/rail_rkeys[](rail 0 stays wire-compatible).--devicesadded topeercache-bench serve/drive.
Changed¶
TransferEngineis multi-rail internally;register_mrreturns one handle per rail; newlocal_endpoints()/n_rails().
0.4.0 - 2026-05-31¶
Added¶
- Two-host (distributed) benchmark (
peercache-bench serve/drive) for a genuine cross-host one-sided RDMA READ;drive --processes Nto escape the GIL. - Benchmark logging (
--log-level/--log-file);directory_read_cache_ttl(default off);max_channels_per_peerconfig;PEERCACHE_RDMA_OP_TIMEOUT_MS.
Changed¶
- Vectorised read hot path (
TransferEngine::batch_read_v, GIL-released). - RDMA tuning: RC QPs use the port's negotiated active MTU;
drain()reaps completions in batches of 16.
Fixed¶
- Benchmark stall on large pages:
HostKVPool.fill_slotreplaced a per-byte Python loop with a singlememmovefrom a template. - Indefinite hang on a stalled RDMA read:
drain()gains a timeout (default 5s) and discards timed-out channels; TCP QP-bootstrap sockets gain timeouts.
0.3.0 - 2026-05-31¶
Added¶
- Concurrent multi-threaded reads/writes: a per-peer channel pool, each
channel an RC QP with its own CQ (capped by
max_channels_per_peer, default 16); matching TCP socket pool and per-call control-plane RPC pool. - Benchmark suite (
peercache-bench): drives theHiCacheStorageinterface exactly as SGLang HiCache does, reporting throughput and latency tails across a thread-model sweep. NewBenchmarksdocs page.
Changed¶
- Shared client state is lock-guarded; broken pooled connections are closed.
- Default ports moved to the
31997-31999band (metrics31997, discovery31998);rdma_port/control_portstay auto-assigned.
0.2.0 - 2026-05-31¶
Added¶
- Disk persistence tier (L4): published pages spill to disk (
disk_path, default/data/peercache/, capped bydisk_size, default100GB). Evicted pages stay in the directory as non-resident and are promoted back on a later read (locally, or on the owner via adata_promoteRPC);existshits kick a best-effort prefetch. - Metrics + monitoring: Prometheus
/metricsendpoint and an embedded HTML dashboard (default port31997).
Changed¶
DataLocationgains aresidentflag for disk-resident pages.
0.1.1 - 2026-05-31¶
Changed¶
- Embedded meta: removed the requirement for a separate meta process (the
node whose IP equals
discovery_addrauto-hosted it in-process; superseded by multi-master discovery in 0.7.0).
Added¶
- Bilingual (English / 中文) documentation with a language switcher.
0.1.0 - 2026-05-31¶
Initial release.
Added¶
- Decentralized architecture: service discovery plus a consistent-hash distributed directory (DHT) sharded across nodes — no centralized master or metadata service.
- C++ RDMA data plane (
libibverbs, RC QPs, one-sidedIBV_WR_RDMA_READ) viapybind11; two-MR model;PeerCacheStore(HiCacheStorage) with v1/v2 and single-key/batch APIs; zero-touch SGLangdynamicbackend integration; TCP fallback transport; MkDocs site and CI/docs/release workflows.