2604.21231v1 Apr 23, 2026 cs.NI

SparKV: 오버헤드 인지형 KV 캐시 로딩을 통한 효율적인 온디바이스 LLM 추론

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

Zhengru Fang

Citations: 473

h-index: 11

Hongyao Liu

Citations: 10

h-index: 2

L. Zhai

Citations: 4

h-index: 1

Junyi Wang

Citations: 8

h-index: 2

Jingshu Chen

Citations: 0

h-index: 0

Jun Huang

Citations: 11

h-index: 2

온디바이스 대규모 언어 모델(LLM)의 효율적인 추론은 제한된 하드웨어 리소스와 전체 입력 컨텍스트를 처리하여 키-값(KV) 캐시를 구성하는 사전 처리 단계의 높은 비용으로 인해 여전히 어려운 과제입니다. 본 논문에서는 클라우드 기반 KV 스트리밍과 온디바이스 연산을 결합하는 적응형 KV 로딩 프레임워크인 SparKV를 제시합니다. SparKV는 개별 KV 청크의 비용을 모델링하고 각 청크를 스트리밍할지 또는 로컬에서 계산할지 결정하며, 두 실행 경로를 겹쳐서 지연 시간을 줄입니다. 또한 SparKV는 무선 연결 및 엣지 리소스 가용성의 변동성을 처리하기 위해 오프라인에서 생성된 스케줄을 런타임에 추가로 개선하여 통신 및 계산 비용을 재균형화합니다. 다양한 데이터 세트, LLM 및 엣지 장치를 대상으로 한 실험 결과, SparKV는 첫 번째 토큰 생성 시간을 1.3배에서 5.1배 단축시키면서 응답 품질에 미치는 영향은 미미하며, 요청당 에너지 소비량을 1.5배에서 3.3배 줄여 실제 온디바이스 배포에 대한 견고성과 실용성을 입증합니다.

Original Abstract

Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3$x-5.1x with negligible impact on response quality, while lowering per-request energy consumption by 1.5x to 3.3x, demonstrating its robustness and practicality for real-world on-device deployment.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!