2601.08743v1 Jan 13, 2026 cs.CL

TableCache: 주요-외래 키 기반 KV 캐시 사전 계산을 통한 저지연 텍스트-SQL 변환

TableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL

Jinbo Su

Citations: 38

h-index: 4

Yuxuan Hu

Citations: 82

h-index: 4

Cuiping Li

Citations: 193

h-index: 6

Hong Chen

Renmin University of China

Citations: 1,245

h-index: 12

Jia Li

Citations: 192

h-index: 7

Lintao Ma

Citations: 1

h-index: 1

Jing Zhang

Citations: 2,353

h-index: 21

텍스트-SQL 변환 작업에서, 기존의 LLM 기반 방법들은 종종 프롬프트에 방대한 데이터베이스 스키마를 포함시켜 문맥 길이를 늘리고, 사전 채우기 지연 시간을 증가시키는 경향이 있습니다. 사용자 쿼리는 일반적으로 반복되는 테이블 집합에 초점을 맞추므로, 쿼리 간에 KV 캐시를 공유할 수 있는 기회가 존재합니다. 하지만 현재의 추론 엔진(예: SGLang, vLLM)은 사용자의 쿼리에 대해 테이블 순서가 다를 때 불필요한 접두사 캐시 복사본을 생성하여 비효율성을 초래합니다. 이러한 비효율성을 해결하기 위해, 우리는 테이블 표현을 KV 캐시로 미리 계산하고, 필요한 캐시를 온라인으로 쿼리하는 방식을 제안합니다. 우리의 접근 방식의 핵심은 테이블 간의 주요-외래 키 관계를 유지하면서 테이블 캐시를 계산하는 것입니다. 또한, 추론 과정에서 효율적인 KV 캐시 검색을 지원하기 위해 Table Trie 구조를 구축했습니다. 캐시 성능을 향상시키기 위해, 우리는 캐시 적중률을 높이기 위한 쿼리 재정렬 전략과 모델 추론 및 캐시 로딩을 병렬화하는 계산 로딩 파이프라인을 포함하는 캐시 관리 시스템을 도입했습니다. 실험 결과, 제안하는 TableCache는 Time to First Token (TTFT)에서 최대 3.62배의 속도 향상을 달성했으며, 성능 저하가 미미했습니다.

Original Abstract

In Text-to-SQL tasks, existing LLM-based methods often include extensive database schemas in prompts, leading to long context lengths and increased prefilling latency. While user queries typically focus on recurrent table sets-offering an opportunity for KV cache sharing across queries-current inference engines, such as SGLang and vLLM, generate redundant prefix cache copies when processing user queries with varying table orders. To address this inefficiency, we propose precomputing table representations as KV caches offline and querying the required ones online. A key aspect of our approach is the computation of table caches while preserving primary foreign key relationships between tables. Additionally, we construct a Table Trie structure to facilitate efficient KV cache lookups during inference. To enhance cache performance, we introduce a cache management system with a query reranking strategy to improve cache hit rates and a computation loading pipeline for parallelizing model inference and cache loading. Experimental results show that our proposed TableCache achieves up to a 3.62x speedup in Time to First Token (TTFT) with negligible performance degradation.

1 Citations

0 Influential

10.5 Altmetric

53.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!