2604.02047v1 Apr 02, 2026 cs.CL

GOOSE: 훈련 없이 추론 속도를 높이는 비등방성 추측 기반 구조

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Tao Jin

Citations: 12

h-index: 2

P. Nguyen

Citations: 4

h-index: 1

Naoya Inoue

Citations: 42

h-index: 4

추측 기반 디코딩은 여러 후보 토큰을 미리 생성하고 한 번의 순방향 연산으로 검증하여 대규모 언어 모델의 추론 속도를 가속화합니다. 후보 토큰들은 트리 구조로 구성되며, 트리의 깊이가 깊을수록 단계별로 더 많은 토큰을 수용할 수 있지만, 깊이를 늘리려면 고정된 검증 예산 하에서 후보 토큰의 다양성(대안)을 희생해야 합니다. 기존의 훈련이 필요 없는 방법들은 단일 토큰 소스에서 후보를 생성하고, 후보의 품질을 구분하지 않고 트리의 형태를 결정합니다. 우리는 두 가지 일반적인 훈련이 필요 없는 토큰 소스 - 입력 컨텍스트에서 복사된 n-gram 일치 토큰과 이전 순방향 연산에서 얻은 통계적 예측 토큰 - 간에 수용률에 큰 차이가 있다는 것을 확인했습니다(5개의 모델과 5개의 벤치마크에서 중앙값 6배 차이, 2~18배 범위). 이러한 품질 차이가 존재할 때, 최적의 트리는 비등방성(비대칭) 구조를 갖습니다. 즉, 신뢰할 수 있는 토큰은 깊은 연결망을 형성하고, 신뢰할 수 없는 토큰은 넓은 가지를 형성하여 균형 잡힌 트리의 깊이 제한을 뛰어넘어야 합니다. 우리는 이러한 구조를 GOOSE라는 훈련이 필요 없는 프레임워크에 구현했습니다. GOOSE는 적응형 스파인 트리를 구축합니다. 스파인 트리는 높은 수용률을 가진 컨텍스트 일치 토큰의 깊은 연결망과 각 노드에서 낮은 수용률을 가진 대안 토큰의 넓은 가지로 구성됩니다. 우리는 GOOSE가 각 소스를 개별적으로 사용할 때 수용되는 토큰의 수보다 적어도 같거나 많다는 것을 증명했습니다. 5개의 LLM(7B-33B)과 5개의 벤치마크에서 GOOSE는 1.9~4.3배의 성능 향상을 달성했으며, 동일한 예산 하에서 균형 잡힌 트리 기반 모델보다 12~33% 더 뛰어난 성능을 보였습니다.

Original Abstract

Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!