2602.01762v1 Feb 02, 2026 cs.AI

PRISM: 추측성 샘플링 초안 모델을 위한 추론의 매개변수적 리팩토링

PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models

Xuliang Wang

Citations: 4

h-index: 1

Maochan Zhen

Citations: 0

h-index: 0

Yuetao Chen

Citations: 47

h-index: 3

Fang Liu

Citations: 7

h-index: 1

Xin Zheng

Citations: 481

h-index: 9

Xing Liu

Citations: 243

h-index: 3

Hong Xu

Citations: 48

h-index: 2

Ming Li

Citations: 32

h-index: 4

거대 언어 모델(LLM)은 자기 회귀적 특성으로 인해 디코딩 속도가 느리다는 단점이 있습니다. 추측성 디코딩(Speculative decoding) 기법은 LLM 디코딩을 가속화할 유망한 해결책으로 부상하여 시스템 및 AI 연구 커뮤니티 모두의 주목을 받고 있습니다. 최근 더 나은 초안 품질을 추구하면서 매개변수 규모가 큰 초안 모델을 사용하는 추세가 나타났으나, 이는 필연적으로 상당한 계산 오버헤드를 초래합니다. 기존 연구가 예측 정확도와 계산 지연 시간 사이의 균형을 맞추는 데 주력한 반면, 우리는 아키텍처 혁신을 통해 이 근본적인 딜레마를 해결하고자 합니다. 우리는 각 예측 단계의 연산을 서로 다른 매개변수 세트로 분리하고 초안 모델의 연산 경로를 리팩토링하여 모델 용량과 추론 비용을 성공적으로 분리하는 PRISM을 제안합니다. 광범위한 실험을 통해 PRISM이 기존의 모든 초안 아키텍처를 능가하며, 탁월한 수락 길이(acceptance length)를 달성하는 동시에 초안 지연 시간을 최소화하여 우수한 종단간 속도 향상을 이끌어냄을 입증했습니다. 또한 PRISM을 통해 스케일링 법칙을 재조명하여, 데이터 양이 증가함에 따라 PRISM이 다른 초안 아키텍처보다 더 효과적으로 확장된다는 사실을 밝혀냈습니다. 엄격하고 공정한 비교를 통해, 우리는 PRISM이 이미 고도로 최적화된 추론 엔진의 디코딩 처리량을 2.6배 이상 향상시킨다는 것을 보여줍니다.

Original Abstract

Large Language Models (LLMs), constrained by their auto-regressive nature, suffer from slow decoding. Speculative decoding methods have emerged as a promising solution to accelerate LLM decoding, attracting attention from both systems and AI research communities. Recently, the pursuit of better draft quality has driven a trend toward parametrically larger draft models, which inevitably introduces substantial computational overhead. While existing work attempts to balance the trade-off between prediction accuracy and compute latency, we address this fundamental dilemma through architectural innovation. We propose PRISM, which disaggregates the computation of each predictive step across different parameter sets, refactoring the computational pathways of draft models to successfully decouple model capacity from inference cost. Through extensive experiments, we demonstrate that PRISM outperforms all existing draft architectures, achieving exceptional acceptance lengths while maintaining minimal draft latency for superior end-to-end speedup. We also re-examine scaling laws with PRISM, revealing that PRISM scales more effectively with expanding data volumes than other draft architectures. Through rigorous and fair comparison, we show that PRISM boosts the decoding throughput of an already highly optimized inference engine by more than 2.6x.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!