2603.17310v1 Mar 18, 2026 cs.AI

InfoDensity: 효율적인 추론을 위한 정보 밀도가 높은 추론 과정에 대한 보상

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

Longyin Zhang

Citations: 23

h-index: 3

Cheng Wei

Citations: 11

h-index: 2

Jung-jae Kim

Citations: 2

h-index: 1

Sheng Chen

Citations: 10

h-index: 2

Nancy F. Chen

Citations: 176

h-index: 7

확장된 추론 능력을 갖춘 대규모 언어 모델(LLM)은 종종 장황하고 중복적인 추론 과정을 생성하여 불필요한 계산 비용을 초래합니다. 기존 강화 학습 접근 방식은 최종 응답 길이를 최적화하여 이 문제를 해결하려고 시도하지만, 중간 추론 단계의 품질을 간과하여 모델이 보상 해킹에 취약하게 만듭니다. 우리는 장황함이 단순히 길이 문제라기보다는, 낮은 품질의 중간 추론 과정의 증상이라고 주장합니다. 이를 조사하기 위해, 추론 단계에 따른 답변 분포의 조건부 엔트로피를 추적하는 실증 연구를 수행했습니다. 연구 결과, 고품질 추론 과정은 두 가지 일관된 특징을 보입니다. 즉, 낮은 불확실성 수렴과 단조적인 진행입니다. 이러한 결과는 고품질 추론 과정이 정보적으로 밀도가 높다는 것을 시사합니다. 즉, 각 단계는 전체 추론 길이에 비해 의미 있는 엔트로피 감소를 제공합니다. 이러한 점에 착안하여, 우리는 InfoDensity라는 강화 학습 훈련을 위한 보상 체계를 제안합니다. InfoDensity는 AUC 기반 보상과 단조성 보상을 결합하여 추론 품질을 통합적으로 측정하며, 길이 스케일링 항을 사용하여 동일한 품질을 더 간결하게 달성하도록 유도합니다. 수학적 추론 벤치마크에 대한 실험 결과, InfoDensity는 정확도 측면에서 최첨단 모델과 동등하거나 뛰어난 성능을 보이며, 토큰 사용량을 크게 줄여 강력한 정확도-효율성 균형을 달성합니다.

Original Abstract

Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computational cost. While existing reinforcement learning approaches address this by optimizing final response length, they neglect the quality of intermediate reasoning steps, leaving models vulnerable to reward hacking. We argue that verbosity is not merely a length problem, but a symptom of poor intermediate reasoning quality. To investigate this, we conduct an empirical study tracking the conditional entropy of the answer distribution across reasoning steps. We find that high-quality reasoning traces exhibit two consistent properties: low uncertainty convergence and monotonic progress. These findings suggest that high-quality reasoning traces are informationally dense, that is, each step contributes meaningful entropy reduction relative to the total reasoning length. Motivated by this, we propose InfoDensity, a reward framework for RL training that combines an AUC-based reward and a monotonicity reward as a unified measure of reasoning quality, weighted by a length scaling term that favors achieving equivalent quality more concisely. Experiments on mathematical reasoning benchmarks demonstrate that InfoDensity matches or surpasses state-of-the-art baselines in accuracy while significantly reducing token usage, achieving a strong accuracy-efficiency trade-off.

2 Citations

0 Influential

3.5 Altmetric

19.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!