2603.12617v1 Mar 13, 2026 cs.LG

초안이 진화할 때: 추론 기반 디코딩과 온라인 학습의 융합

When Drafts Evolve: Speculative Decoding Meets Online Learning

Yichao Fu

Citations: 395

h-index: 4

Yuanpan Qian

Citations: 1

h-index: 1

Hao Wu

Citations: 22

h-index: 3

Hao Zhang

Citations: 26

h-index: 1

Pengfei Zhao

Citations: 1

h-index: 1

추론 기반 디코딩(speculative decoding)은 대규모 언어 모델 추론 속도를 가속화하는 널리 사용되는 방법론으로, 경량화된 초안 모델이 빠르게 후보 토큰을 생성하고, 더 큰 대상 모델이 이를 병렬로 검증합니다. 그러나 모델 용량의 제한으로 인해 초안 모델은 종종 대상 분포를 정확하게 근사하지 못하여, 수용 가능한 토큰 길이가 짧아지고 속도 향상 효과가 감소합니다. 중요한 점은 추론 기반 디코딩이 추가 비용 없이 초안 모델과 대상 모델 간의 차이를 정량화하는 검증 피드백을 제공한다는 것입니다. 이러한 과정은 자연스럽게 반복적인 '초안 생성-피드백 제공-초안 개선' 루프를 형성하며, 이는 온라인 학습 패러다임을 정확하게 반영합니다. 이러한 연관성에 착안하여, 우리는 상호 작용적 피드백을 체계적으로 활용하여 초안 모델을 지속적으로 발전시키는 통합 프레임워크인 OnlineSpec을 제안합니다. 동적 후회 최소화(dynamic regret minimization)에 기반하여, 온라인 학습 성능과 추론 시스템의 가속화 속도 간의 형식적인 연관성을 확립하고, 낙관적인 온라인 학습(optimistic online learning)을 통해 과거 기울기를 예측 업데이트 힌트로 활용하거나, 여러 초안 모델을 동적으로 유지하는 온라인 앙상블 학습(online ensemble learning)과 같은 최신 온라인 학습 기법을 활용한 새로운 알고리즘을 개발했습니다. 우리의 알고리즘은 이론적인 근거를 갖추고 있으며, 7개의 벤치마크 및 3개의 기반 모델에서 최대 24%의 속도 향상을 달성했습니다.

Original Abstract

Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidly generates candidate tokens that are then verified in parallel by a larger target model. However, due to limited model capacity, drafts often struggle to approximate the target distribution, resulting in shorter acceptance lengths and diminished speedup. A key yet under-explored observation is that speculative decoding inherently provides verification feedback that quantifies the deviation between the draft and target models at no additional cost. This process naturally forms an iterative "draft commits-feedback provides-draft adapts" evolving loop, which precisely matches the online learning paradigm. Motivated by this connection, we propose OnlineSpec, a unified framework that systematically leverages interactive feedback to continuously evolve draft models. Grounded in dynamic regret minimization, we establish a formal link between online learning performance and speculative system's acceleration rate, and develop novel algorithms via modern online learning techniques, including optimistic online learning that adaptively reuses historical gradients as predictive update hints, and online ensemble learning that dynamically maintains multiple draft models. Our algorithms are equipped with theoretical justifications and improved acceleration rates, achieving up to 24% speedup over seven benchmarks and three foundation models.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!