Skip to content
HN On Hacker News ↗

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

▲ 44 points 1 comments by matt_d 1mo ago HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

26 %

AI likelihood · overall

Human
100% human-written 0% AI-generated
SEGMENTS · HUMAN 1 of 2
SEGMENTS · AI 0 of 2
WORD COUNT 280
PEAK AI % 32% · §2
Analyzed
Apr 22
backend: pangram/v3.3
Segments scanned
2 windows
avg 140 words each
Distribution
100 / 0%
human / AI fraction
Verdict
Human
Pangram v3.3

Article text · 280 words · 2 segments analyzed

Human AI-generated
§1 Human · 25%

View PDF HTML (experimental) Abstract:Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable queueing, and poor utilization. We present Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. Rather than treating reduced KVCache as sufficient, PrfaaS combines model-side KV efficiency with system-side selective offloading, bandwidth-aware scheduling, and cache-aware request placement. This design removes the requirement that heterogeneous accelerators share the same low-latency RDMA fabric, enabling independent scaling of prefill and decode capacity across loosely coupled clusters. In a case study using an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment achieves 54% and 32% higher serving throughput than homogeneous PD and naive heterogeneous baselines, respectively, while consuming only modest cross-datacenter bandwidth. Comments: 16 pages, 5 figures, 6 tables

Subjects: Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2604.15039 [cs.DC]   (or arXiv:2604.15039v1 [cs.

§2 Mixed · 32%

DC] for this version)   https://doi.org/10.48550/arXiv.2604.15039 arXiv-issued DOI via DataCite (pending registration) Submission history From: Ruoyu Qin [view email] [v1] Thu, 16 Apr 2026 14:07:41 UTC (244 KB)