2601.12904v1 Jan 19, 2026 cs.CL

프리픽스 캐시에서 퓨전 RAG 캐시로: 검색 증강 생성(Retrieval-Augmented Generation)에서 LLM 추론 속도 향상

From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

Jing Tang

Citations: 169

h-index: 5

Jiahao Wang

Citations: 16

h-index: 2

Weiyu Xie

Citations: 17

h-index: 2

Mingxing Zhang

Citations: 291

h-index: 3

Jianwei Dong

Citations: 16

h-index: 2

Yuening Zhu

Citations: 16

h-index: 2

Chen Lin

Citations: 16

h-index: 2

Yaochen Han

Citations: 2

h-index: 1

Zhiyuan Ai

Citations: 16

h-index: 2

Xianglin Chen

Citations: 16

h-index: 2

Yongwei Wu

Citations: 331

h-index: 8

Cong Jiang

Citations: 5

h-index: 2

Bo Zhang

Citations: 16

h-index: 2

검색 증강 생성(RAG)은 외부 지식을 통합하여 대규모 언어 모델(LLM)을 향상시키지만, 이는 환각 현상을 줄이는 동시에 프롬프트 길이를 증가시킵니다. 이러한 프롬프트 길이 증가는 계산 비용 증가와 첫 번째 토큰 생성 시간(TTFT) 증가로 이어집니다. 이러한 문제를 완화하기 위해 기존 솔루션은 검색된 각 텍스트 조각의 사전 처리된 KV 캐시를 재사용하여 RAG의 속도를 높이는 것을 목표로 합니다. 그러나 조각 간의 문맥 정보 부족은 생성 품질 저하를 초래하며, KV 캐시 재사용의 잠재적 이점을 크게 감소시킵니다. 핵심 과제는 생성 품질을 유지하면서 미리 계산된 조각의 KV 캐시를 어떻게 재사용할 것인가입니다. 본 논문에서는 RAG의 전처리 및 재처리 단계를 모두 최적화하는 새로운 추론 프레임워크인 FusionRAG를 제안합니다. 오프라인 전처리 단계에서는 각 조각에 관련 텍스트 조각의 정보를 포함시키고, 온라인 재처리 단계에서는 모델이 집중하는 토큰에 대한 KV 캐시를 재계산합니다. 결과적으로, FusionRAG는 생성 품질과 효율성 간의 균형을 더 잘 맞춥니다. 실험 결과, FusionRAG는 이전 최고 성능 모델과 비교하여 동일한 재계산 비율에서 생성 품질을 크게 향상시켰습니다. 전체 토큰의 15% 미만을 재계산하여 FusionRAG는 기준 모델보다 최대 70% 높은 정규화된 F1 점수를 달성하고, Full Attention 방식에 비해 TTFT를 2.66배에서 9.39배까지 줄였습니다.

Original Abstract

Retrieval-Augmented Generation enhances Large Language Models by integrating external knowledge, which reduces hallucinations but increases prompt length. This increase leads to higher computational costs and longer Time to First Token (TTFT). To mitigate this issue, existing solutions aim to reuse the preprocessed KV cache of each retrieved chunk to accelerate RAG. However, the lack of cross-chunk contextual information leads to a significant drop in generation quality, leaving the potential benefits of KV cache reuse largely unfulfilled. The challenge lies in how to reuse the precomputed KV cache of chunks while preserving generation quality. We propose FusionRAG, a novel inference framework that optimizes both the preprocessing and reprocessing stages of RAG. In the offline preprocessing stage, we embed information from other related text chunks into each chunk, while in the online reprocessing stage, we recompute the KV cache for tokens that the model focuses on. As a result, we achieve a better trade-off between generation quality and efficiency. According to our experiments, FusionRAG significantly improves generation quality at the same recomputation ratio compared to previous state-of-the-art solutions. By recomputing fewer than 15% of the tokens, FusionRAG achieves up to 70% higher normalized F1 scores than baselines and reduces TTFT by 2.66x-9.39x compared to Full Attention.

2 Citations

1 Influential

4 Altmetric

24.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!