2606.05875v1 Jun 04, 2026 cs.AI

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

Jia Zhu

Citations: 20

h-index: 2

Lei Chen

Citations: 163

h-index: 4

Haoyang Li

Citations: 451

h-index: 10

Wangze Ni

Citations: 51

h-index: 4

Peng Cheng

Citations: 11

h-index: 2

Jiabao Jin

Citations: 20

h-index: 2

Kui Ren

Citations: 81

h-index: 5

Jianxin Yan

Citations: 10

h-index: 2

Zhenxi Li

Citations: 142

h-index: 4

Zhitao Shen

Citations: 36

h-index: 3

Xuemin Lin

Citations: 21

h-index: 3

Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!