2603.10359v1 Mar 11, 2026 cs.AI

HEAL: 후향 엔트로피 기반 학습을 통한 추론 증류

HEAL: Hindsight Entropy-Assisted Learning for Reasoning Distillation

Shiguo Lian

Citations: 309

h-index: 6

Zhaoxiang Liu

Citations: 140

h-index: 5

Kai Wang

Citations: 252

h-index: 5

Wenjing Zhang

Citations: 208

h-index: 5

Jie-fu Huang

Citations: 147

h-index: 2

Yi Shen

Citations: 18

h-index: 2

Ping Chen

Citations: 6

h-index: 1

Ning Wang

Citations: 170

h-index: 4

Jiangze Yan

Citations: 143

h-index: 1

Shuming Shi

Citations: 148

h-index: 2

대규모 추론 모델(LRM)에서 추론 능력을 소형 모델로 전달하는 것은 일반적으로 거부 샘플링의 한계로 인해 제약됩니다. 기존 방법은 교사 모델을 정적인 필터로 취급하며, 교사 모델이 독립적으로 유효한 해결책을 탐색하지 못하는 복잡한 "특이 사례" 문제를 버립니다. 이는 학생 모델에게 인위적인 "교사 모델의 한계"를 만듭니다. 본 연구에서는 이러한 추론 격차를 해소하기 위한 RL(강화 학습)이 필요 없는 프레임워크인 Hindsight Entropy-Assisted Learning (HEAL)을 제안합니다. HEAL은 교육 이론의 "가까운 발달 영역(ZPD, Zone of Proximal Development)" 개념을 활용하여 세 가지 핵심 모듈을 결합합니다. (1) Guided Entropy-Assisted Repair (GEAR)는 엔트로피 동역학을 통해 중요한 추론 중단 지점을 감지하고, 손상된 경로를 수정하기 위한 목표 지향적인 후향 힌트를 주입하는 능동적인 개입 메커니즘입니다. (2) Perplexity-Uncertainty Ratio Estimator (PURE)는 진정한 인지적 돌파구를 가짜 지름길과 분리하는 엄격한 필터링 프로토콜입니다. (3) Progressive Answer-guided Curriculum Evolution (PACE)는 기초적인 정렬부터 최첨단 돌파구에 이르기까지 학습 과정을 구성하는 세 단계로 구성된 증류 전략입니다. 다양한 벤치마크에 대한 광범위한 실험 결과, HEAL은 기존의 SFT(Supervised Fine-Tuning) 증류 및 기타 기준 모델보다 훨씬 우수한 성능을 보였습니다.

Original Abstract

Distilling reasoning capabilities from Large Reasoning Models (LRMs) into smaller models is typically constrained by the limitation of rejection sampling. Standard methods treat the teacher as a static filter, discarding complex "corner-case" problems where the teacher fails to explore valid solutions independently, thereby creating an artificial "Teacher Ceiling" for the student. In this work, we propose Hindsight Entropy-Assisted Learning (HEAL), an RL-free framework designed to bridge this reasoning gap. Drawing on the educational theory of the Zone of Proximal Development(ZPD), HEAL synergizes three core modules: (1) Guided Entropy-Assisted Repair (GEAR), an active intervention mechanism that detects critical reasoning breakpoints via entropy dynamics and injects targeted hindsight hints to repair broken trajectories; (2) Perplexity-Uncertainty Ratio Estimator (PURE), a rigorous filtering protocol that decouples genuine cognitive breakthroughs from spurious shortcuts; and (3) Progressive Answer-guided Curriculum Evolution (PACE), a three-stage distillation strategy that organizes training from foundational alignment to frontier breakthrough. Extensive experiments on multiple benchmarks demonstrate that HEAL significantly outperforms traditional SFT distillation and other baselines.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!