2604.22881v1 Apr 24, 2026 cs.LG

MTServe: 계층적 캐시를 이용한 생성형 추천 모델의 효율적인 서비스 제공

MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

Xin Wang

Citations: 22

h-index: 3

Chi Ma

Citations: 108

h-index: 3

Shao-nan Chen

Citations: 1

h-index: 1

Pu Wang

Citations: 7

h-index: 2

Qiaorui Chen

Citations: 19

h-index: 3

Jiayu Sun

Citations: 42

h-index: 4

Shijie Liu

Citations: 52

h-index: 2

Zehuan Wang

Citations: 326

h-index: 7

Lei Yu

Citations: 123

h-index: 4

Chuan Liu

Citations: 125

h-index: 4

Fei Jiang

Citations: 10

h-index: 2

Wei Lin

Citations: 120

h-index: 3

Hao Wang

Citations: 25

h-index: 3

Jiawei Jiang

Citations: 87

h-index: 5

Xiao Yan

Citations: 84

h-index: 5

Meng Zhou

Citations: 50

h-index: 1

J. Qiu

Citations: 361

h-index: 4

생성형 추천(Generative Recommendation, GR)은 우수한 모델링 능력을 제공하지만, 긴 사용자 기록을 반복적으로 인코딩해야 하므로 추론 비용이 매우 높다는 단점이 있습니다. 요청 간 Key-Value (KV) 캐시 재사용은 상당한 최적화 기회를 제공하지만, 개별 사용자 상태의 방대한 규모는 물리적 GPU 제한을 훨씬 초과하는 저장 공간 폭발을 야기합니다. 본 논문에서는 호스트 RAM을 확장 가능한 백업 저장소로 활용하여 GPU 메모리를 가상화하는 계층적 캐시 관리 시스템인 MTServe를 제안합니다. MTServe는 계층 간의 I/O 격차를 해소하기 위해 하이브리드 저장 레이아웃, 비동기 데이터 전송 파이프라인, 그리고 지역성 기반 교체 정책을 포함한 다양한 시스템 수준 최적화를 도입합니다. 공개 및 실제 데이터셋에서 MTServe는 최대 3.1배의 속도 향상을 달성하면서도 거의 완벽한 히트 비율(>98.5%)을 유지합니다.

Original Abstract

Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (>98.5%).

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!