2603.12038v1 Mar 12, 2026 cs.LG

슬로우-패스트 추론: 문장 내 지원 안정성을 활용한 학습 불필요한 추론 속도 향상

Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Xingyu Xie

Citations: 26

h-index: 4

Zhaocheng Yu

Citations: 2

h-index: 1

Yue Liao

Citations: 217

h-index: 5

Tao Wang

Citations: 0

h-index: 0

Kim-Chuan Toh

Citations: 37

h-index: 4

Shuicheng Yan

Citations: 289

h-index: 6

긴 문맥을 사용하는 자기회귀 디코딩은 각 디코딩 단계에서 증가하는 히스토리를 반복적으로 처리해야 하므로 비용이 많이 듭니다. 우리는 디코딩 과정에서 일관된 패턴을 관찰했습니다. 즉, 문장 내에서, 더 일반적으로 짧고 의미적으로 일관된 구간 내에서, 주요 어텐션 지원은 종종 크게 안정적입니다. 이러한 관찰에 따라, 우리는 학습이 필요 없는 디코딩 프레임워크인 슬로우-패스트 추론(SFI)을 제안합니다. SFI는 생성을 빈번한 저비용의 빠른 단계와 가끔 발생하는 집중적인 어텐션의 느린 단계로 분리합니다. 빠른 단계는 효율적인 디코딩을 위해 작은 희소 메모리를 재사용합니다. 느린 단계는 의미 경계 근처에서 트리거됩니다. 느린 단계에서 모델은 더 넓은 문맥을 다시 방문하고, 선택기를 사용하여 이후의 빠른 단계에 사용할 메모리를 업데이트합니다. 평가된 모든 문맥 길이에서 SFI는 약 1.6배에서 14.4배 더 높은 디코딩 처리량을 제공하며, 동시에 긴 문맥 및 긴 CoT(Chain-of-Thought) 설정에서 전체 KV(Key-Value) 기준과 동등한 품질을 유지합니다. SFI는 학습이 필요 없고 기존 체크포인트에 직접 적용할 수 있으므로, 긴 문맥, 장기적인 시나리오 및 에이전트 기반 워크로드에서 최신 자기회귀 추론 모델의 추론 비용을 줄이는 실용적인 방법을 제공합니다.

Original Abstract

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!