2605.07139v1 May 08, 2026 cs.CL

추론 공간 압축을 통한 구조적 추론 증류

Structural Rationale Distillation via Reasoning Space Compression

Jiankun Wang

Citations: 257

h-index: 8

Jialin Yang

Citations: 5

h-index: 1

Jiajun Wu

Citations: 11

h-index: 2

Henry Leung

Citations: 150

h-index: 4

Jiayu Zhou

Citations: 6

h-index: 1

Steve Drew

Citations: 7

h-index: 2

대규모 언어 모델(LLM)에서 더 작은 모델로 추론을 증류할 때, 유사한 문제에 대한 교사 모델의 설명은 종종 구조와 전략 측면에서 크게 다릅니다. 마치 요리사가 같은 요리를 매번 다르게 만드는 것처럼, 이러한 불일치는 학생 모델에게 노이즈가 많은 지침을 제공하여 학습을 어렵게 만듭니다. 본 논문에서는 추론 경로 압축을 통한 증류(D-RPC)라는 방법을 제안합니다. D-RPC는 교사 모델이 재사용 가능한 고수준 추론 경로의 제한된 집합을 따르도록 강제합니다. 각 학습 질문에 대해, D-RPC는 가장 관련성이 높은 경로를 검색하고 교사 모델이 해당 경로를 따르도록 유도하여, 유사한 문제에 대해 일관성을 유지하면서도 다양한 문제 유형을 포괄할 수 있는 설명을 생성합니다. PAC-Bayes 분석을 통해 경로 집합의 크기와 커버리지 간의 균형을 공식화했습니다. 작은 집합은 지도 학습 엔트로피를 줄이지만, 커버리지 간극의 위험이 있으며, 일반화 경계는 최적의 중간 크기를 제시하며, 이는 우리의 실험을 통해 확인되었습니다. 우리는 두 가지 학생 모델을 사용하여 수학 및 상식 추론 벤치마크 5개에서 D-RPC가 연쇄적 사고 증류, 자유 형식 설명 생성, 직접 증류 및 구조화된 지도 학습과 같은 기존 방법보다 일관되게 우수한 성능을 보였으며, 템플릿 기반 방법보다 더 적은 토큰을 사용했습니다.

Original Abstract

When distilling reasoning from large language models (LLMs) into smaller ones, teacher rationales for similar problems often vary wildly in structure and strategy. Like a chef who makes the same dish differently each time, this inconsistency burdens the student with noisy supervision that is hard to internalize. We propose Distillation through Reasoning Path Compression (D-RPC), which constrains the teacher to follow a compact, dynamically maintained bank of reusable high-level reasoning paths. For each training question, D-RPC retrieves the most relevant path and conditions the teacher to follow it, producing rationales that are consistent across similar problems yet diverse enough to cover different problem types. A PAC-Bayes analysis formalizes the resulting trade-off between bank size and coverage: smaller banks reduce supervision entropy but risk coverage gaps, and the generalization bound identifies an optimal intermediate size confirmed by our ablations. Across five math and commonsense reasoning benchmarks with two student models, D-RPC consistently outperforms chain-of-thought distillation, freeform rationale generation, direct distillation, and structured-supervision baselines, while using fewer tokens than template-heavy alternatives.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!