2606.05875v1 Jun 04, 2026 cs.AI

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

Jia Zhu
Jia Zhu
Citations: 20
h-index: 2
Lei Chen
Lei Chen
Citations: 163
h-index: 4
Haoyang Li
Haoyang Li
Citations: 451
h-index: 10
Wangze Ni
Wangze Ni
Citations: 51
h-index: 4
Peng Cheng
Peng Cheng
Citations: 11
h-index: 2
Jiabao Jin
Jiabao Jin
Citations: 20
h-index: 2
Kui Ren
Kui Ren
Citations: 81
h-index: 5
Jianxin Yan
Jianxin Yan
Citations: 10
h-index: 2
Zhenxi Li
Zhenxi Li
Citations: 142
h-index: 4
Zhitao Shen
Zhitao Shen
Citations: 36
h-index: 3
Xuemin Lin
Xuemin Lin
Citations: 21
h-index: 3

Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.

0 Citations
0 Influential
5 Altmetric
25.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!