2602.13165v1 Feb 13, 2026 cs.IR

계층화된 LLM 아키텍처를 위한 비동기 검증 기반 의미 캐싱

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Haozhe Wang

Citations: 612

h-index: 6

Asmita Singh

Citations: 18

h-index: 3

Laxmi Naga Santosh Attaluri

Citations: 2

h-index: 1

T. Chiam

Citations: 2

h-index: 1

Weihua Zhu

Citations: 164

h-index: 7

대규모 언어 모델(LLM)은 검색, 지원 및 에이전트 워크플로우의 핵심 경로에 위치하며, 따라서 의미 캐싱은 추론 비용과 지연 시간을 줄이는 데 필수적입니다. 실제 배포에서는 일반적으로 정적 캐시와 동적 캐시를 결합한 계층 구조를 사용합니다. 정적 캐시는 로그에서 추출한 검증된 응답을 저장하며, 동적 캐시는 실시간으로 업데이트됩니다. 일반적으로 두 계층 모두 단일 임베딩 유사성 임계값을 사용하여 관리되므로, 보수적인 임계값은 안전한 재사용 기회를 놓치게 되고, 공격적인 임계값은 의미적으로 부정확한 응답을 제공할 위험이 있습니다. 본 논문에서는 정적 커버리지를 확장하면서도 서비스 결정은 변경하지 않는 비동기 LLM 기반 캐싱 정책인 Krites를 소개합니다. Krites는 핵심 경로에서 표준 정적 임계값 정책과 동일하게 작동합니다. 프롬프트의 가장 가까운 정적 항목이 정적 임계값 아래에 있는 경우, Krites는 비동기적으로 LLM 검증기를 호출하여 정적 응답이 새 프롬프트에 적합한지 확인합니다. 승인된 항목은 동적 캐시로 승격되어 향후 반복 및 재구성된 쿼리가 검증된 정적 답변을 재사용할 수 있도록 하고, 시간이 지남에 따라 정적 커버리지를 확장합니다. 대화형 및 검색 워크로드에 대한 시뮬레이션 결과, Krites는 튜닝된 기준 모델에 비해 대화형 트래픽 및 검색 스타일 쿼리에 대해 정적 답변으로 처리되는 요청의 비율을 최대 3.9배까지 증가시키면서도 핵심 경로 지연 시간은 동일하게 유지됩니다.

Original Abstract

Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbf{Krites}, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promoted into the dynamic cache, allowing future repeats and paraphrases to reuse curated static answers and expanding static reach over time. In trace-driven simulations on conversational and search workloads, Krites increases the fraction of requests served with curated static answers (direct static hits plus verified promotions) by up to $\textbf{3.9}$ times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency.

2 Citations

0 Influential

3.5 Altmetric

19.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!