2601.03969v1 Jan 07, 2026 cs.AI

길이 이동 방지: 효율적인 추론 모델 학습을 위한 동적 이상치 절단

Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models

Wei Wu

Citations: 2

h-index: 1

Liyi Chen

Citations: 300

h-index: 7

Congxi Xiao

Citations: 255

h-index: 5

Qimeng Wang

Citations: 93

h-index: 4

Chengqiang Lu

Citations: 96

h-index: 3

Yan Gao

Citations: 38

h-index: 4

Yi Wu

Citations: 31

h-index: 4

Yao Hu

Citations: 22

h-index: 3

Tian Wang

Citations: 103

h-index: 3

Hui Xiong

Citations: 30

h-index: 1

검증 가능한 보상을 사용하는 강화 학습으로 강화된 대형 추론 모델들은 사고 과정(chain-of-thought)을 확장함으로써 상당한 성능 향상을 달성했습니다. 그러나 이러한 패러다임은 모델이 간단한 질문에 대해서도 과도하게 장황한 답변을 하는 경우가 많아 상당한 배포 비용을 초래합니다. 명시적인 길이 페널티에 의존하는 기존의 효율적 추론 방법들은 종종 최적화 충돌을 일으키며, 과도한 생각을 유발하는 생성 메커니즘에 대해서는 충분히 다루지 못했습니다. 본 논문에서는 학습 중에 모델이 사소한 입력에 대해 불필요한 추론을 점점 더 많이 생성하는 '길이 이동(length shift)'이라는 현상을 규명합니다. 이를 해결하기 위해, 우리는 불필요한 토큰을 선택적으로 억제하는 학습 단계의 개입 방법인 '동적 이상치 절단(DOT, Dynamic Outlier Truncation)'을 도입합니다. 이 방법은 복잡한 문제에 대한 장기 추론 능력은 보존하면서, 정답을 도출한 롤아웃 그룹 내에서 응답 길이가 극단적으로 긴 경우만을 대상으로 합니다. 이러한 개입을 보완하고 안정적인 수렴을 보장하기 위해, 우리는 보조 KL 정규화와 예측적 동적 샘플링을 추가로 통합합니다. 다양한 모델 규모에 걸친 실험 결과는 우리의 접근 방식이 효율성-성능 파레토 경계선을 획기적으로 확장함을 보여줍니다. 특히 AIME-24 벤치마크에서 우리의 방법은 초기 정책 대비 정확도를 높이는 동시에 추론 토큰 사용량을 78% 감소시켰으며, 기존 최첨단 효율적 추론 방법들을 능가했습니다.

Original Abstract

Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!