2602.03921v1 Feb 03, 2026 cs.LG

SpecMD: 추론 전문가 프리페칭에 대한 종합 연구

SpecMD: A Comprehensive Study On Speculative Expert Prefetching

Duc N. M. Hoang

Citations: 31

h-index: 4

Ajay Jaiswal

Citations: 2

h-index: 1

Mohammad Samragh

Citations: 109

h-index: 5

Minsik Cho

Citations: 24

h-index: 3

Mixture-of-Experts (MoE) 모델은 희소 전문가 활성화를 가능하게 하며, 이는 모델의 파라미터 중 일부만 각 추론 과정에서 사용됨을 의미합니다. 그러나 이러한 희소성을 실제 성능으로 전환하기 위해서는 전문가 캐싱 메커니즘이 필요합니다. 기존 연구에서는 하드웨어 중심의 캐싱 정책이 제안되었지만, 이러한 다양한 캐싱 정책들이 서로 어떻게 상호 작용하며, 다양한 하드웨어 사양에 어떻게 영향을 미치는지에 대한 이해는 부족했습니다. 이러한 격차를 해소하기 위해, 우리는 다양한 하드웨어 구성에서 임의의 캐싱 정책을 벤치마킹하기 위한 표준화된 프레임워크인 extbf{SpecMD}를 개발했습니다. SpecMD를 사용하여, 우리는 여러 MoE 캐싱 전략에 대한 철저한 벤치마킹을 수행했으며, 제약 조건이 있는 현실적인 환경에서 기존 접근 방식을 재현하고 확장했습니다. 우리의 실험 결과, MoE 전문가 접근 패턴은 시간적 지역성 가정(예: LRU, LFU)과 일치하지 않는다는 것을 보여줍니다. 이러한 관찰에 따라, 우리는 MoE의 예측 가능한 전문가 접근 패턴을 활용하여 충돌 미스를 최대 85배까지 줄이는 새로운 제거 정책인 extbf{Least-Stale}을 제안합니다. 이러한 성능 향상을 통해, 우리는 5% 또는 0.6GB의 VRAM 캐시 용량으로 OLMoE에서 88% 이상의 히트율을 달성하고, Time-to-first-token (TTFT)을 최대 34.7%까지 줄였습니다.

Original Abstract

Mixture-of-Experts (MoE) models enable sparse expert activation, meaning that only a subset of the model's parameters is used during each inference. However, to translate this sparsity into practical performance, an expert caching mechanism is required. Previous works have proposed hardware-centric caching policies, but how these various caching policies interact with each other and different hardware specification remains poorly understood. To address this gap, we develop \textbf{SpecMD}, a standardized framework for benchmarking ad-hoc cache policies on various hardware configurations. Using SpecMD, we perform an exhaustive benchmarking of several MoE caching strategies, reproducing and extending prior approaches in controlled settings with realistic constraints. Our experiments reveal that MoE expert access is not consistent with temporal locality assumptions (e.g LRU, LFU). Motivated by this observation, we propose \textbf{Least-Stale}, a novel eviction policy that exploits MoE's predictable expert access patterns to reduce collision misses by up to $85\times$ over LRU. With such gains, we achieve over $88\%$ hit rates with up to $34.7\%$ Time-to-first-token (TTFT) reduction on OLMoE at only $5\%$ or $0.6GB$ of VRAM cache capacity.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!