2603.27027v1 Mar 27, 2026 cs.CL

TAPS: 작업 인식 제안 분포를 활용한 추론적 샘플링

TAPS: Task Aware Proposal Distributions for Speculative Sampling

M. Zbib

Citations: 92

h-index: 6

M. Bazzi

Citations: 2

h-index: 1

Ammar Mohanna

Citations: 25

h-index: 2

Hasan Hammoud

Citations: 3,387

h-index: 15

Bernard Ghanem

Citations: 20

h-index: 2

추론적 디코딩은 경량 모델이 미래 토큰을 제안하고, 더 큰 모델이 이를 병렬적으로 검증하여 자기 회귀 생성 속도를 가속화합니다. 그러나 일반적으로 드래프트 모델은 광범위한 일반 코퍼스에서 학습되므로, 추론적 디코딩의 품질이 드래프트 학습 분포에 얼마나 의존하는지 불분명합니다. 본 연구에서는 MathInstruct, ShareGPT 및 혼합 데이터 변형으로 학습된 경량 HASS 및 EAGLE-2 드래프터를 사용하여 이 질문을 연구하고, MT-Bench, GSM8K, MATH-500 및 SVAMP 데이터셋에서 성능을 평가했습니다. 수용 길이(acceptance length)를 측정 결과, 작업별 학습은 명확한 특수화를 가져옴을 알 수 있었습니다. MathInstruct으로 학습된 드래프트는 추론 벤치마크에서 가장 강력한 성능을 보였으며, ShareGPT로 학습된 드래프트는 MT-Bench에서 가장 강력한 성능을 보였습니다. 혼합 데이터 학습은 견고성을 향상시키지만, 더 큰 혼합 데이터가 모든 디코딩 온도에서 항상 우수한 성능을 보이는 것은 아닙니다. 또한 추론 시점에 특수화된 드래프터를 어떻게 결합할 수 있는지 연구했습니다. 단순한 체크포인트 평균화는 성능이 좋지 않았지만, 신뢰도 기반 라우팅은 단일 도메인 드래프트보다 성능이 우수했으며, 병합 트리 검증은 두 모델 모두에서 가장 높은 수용 길이를 제공했습니다. 또한 엔트로피보다 신뢰도가 라우팅 신호로서 더 유용하다는 것을 확인했습니다. 거부된 토큰은 일반적으로 엔트로피가 높지만, 신뢰도는 벤치마크 수준의 명확한 라우팅 결정을 내립니다. 이러한 결과는 추론적 디코딩의 품질이 드래프트 아키텍처뿐만 아니라 드래프트 학습 데이터와 다운스트림 작업 간의 일치 여부에 따라 달라지며, 특수화된 드래프터는 가중치 공간보다 추론 시점에 더 잘 결합될 수 있음을 보여줍니다.

Original Abstract

Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!