2601.06911v1 Jan 11, 2026 cs.CL

분포 명확성: 대규모 언어 모델에서 강화 학습 친화성을 유발하는 숨겨진 요인

Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models

Haifeng Wang

Citations: 129

h-index: 5

Shaoning Sun

Citations: 36

h-index: 2

Mingzhu Cai

Baidu Inc.

Citations: 6

h-index: 1

H. He

Citations: 987

h-index: 15

Bingjin Chen

Citations: 5

h-index: 1

Siqi Bao

Citations: 763

h-index: 12

Yujiu Yang

Citations: 62

h-index: 4

Hua Wu

Citations: 2

h-index: 1

언어 모델 계열은 강화 학습을 통해 얻을 수 있는 이익 측면에서 현저한 차이를 보입니다. 동일한 훈련 조건 하에서 Qwen과 같은 모델은 상당한 성능 향상을 보이는 반면, Llama와 같은 모델은 제한적인 개선을 보입니다. 데이터 중심적인 접근 방식 외에도, 우리는 이러한 차이가 숨겨진 구조적 특징, 즉 확률 공간에서의 **분포 명확성**을 반영한다는 것을 밝혀냈습니다. 세 단계에 걸친 분석(현상-메커니즘-해석)을 통해, 강화 학습에 적합한 모델은 정답 및 오답에 대한 확률 할당에서 클래스 내 응집성과 클래스 간 분리도를 나타내는 것을 발견했습니다. 우리는 **실루엣 계수**($S$)를 사용하여 이러한 명확성을 정량화하고, (1) 높은 $S$ 값이 강화 학습 성능과 강한 상관관계를 가지며, (2) 낮은 $S$ 값은 심각한 논리 오류 및 추론 불안정과 관련되어 있음을 입증했습니다. 이 특징을 검증하기 위해, 우리는 훈련 과정에서 $S$ 값이 낮은 샘플을 우선적으로 처리하는 **실루엣 기반 재가중치 부여 전략**을 도입했습니다. 여섯 가지 수학적 벤치마크에 대한 실험 결과, 모든 모델 계열에서 일관된 성능 향상이 나타났으며, 특히 AIME24에서 최대 5.9점의 성능 향상을 보였습니다. 본 연구는 **분포 명확성**을 강화 학습 친화성의 근본적인, 그리고 훈련 가능한 특성으로 확립했습니다.

Original Abstract

Language model families exhibit striking disparity in their capacity to benefit from reinforcement learning: under identical training, models like Qwen achieve substantial gains, while others like Llama yield limited improvements. Complementing data-centric approaches, we reveal that this disparity reflects a hidden structural property: \textbf{distributional clarity} in probability space. Through a three-stage analysis-from phenomenon to mechanism to interpretation-we uncover that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses. We quantify this clarity using the \textbf{Silhouette Coefficient} ($S$) and demonstrate that (1) high $S$ correlates strongly with RL performance; (2) low $S$ is associated with severe logic errors and reasoning instability. To confirm this property, we introduce a Silhouette-Aware Reweighting strategy that prioritizes low-$S$ samples during training. Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24. Our work establishes distributional clarity as a fundamental, trainable property underlying RL-Friendliness.

1 Citations

0 Influential

7.5 Altmetric

38.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!