2604.16027v1 Apr 17, 2026 cs.CL

사후 훈련 과정에서 출력 다양성은 어떻게 감소하는가?

Where does output diversity collapse in post-training?

Constantinos F. Karouzos

Citations: 68

h-index: 2

Xingwei Tan

Citations: 29

h-index: 4

Nikolaos Aletras

University of Sheffield

Citations: 5,451

h-index: 29

사후 훈련된 언어 모델은 기본 모델보다 덜 다양한 출력을 생성합니다. 이러한 출력 다양성 감소는 다양한 샘플에 의존하는 추론 시간 확장 방법의 효율성을 저해하며, 창의적이고 가치 판단이 필요한 작업에서 모델 출력의 획일화를 초래할 위험이 있습니다. 기존 연구에서는 이러한 감소 현상을 특정 사후 훈련 방법에 귀속짓지만, 훈련 데이터 구성과 방법 간의 관계, 그리고 생성 형식과 모델 가중치 간의 관계를 명확히 구분하지 않았습니다. 본 연구에서는 Olmo 3 모델의 세 가지 병렬적인 사후 훈련 경로(Think: 체인 오브 소트 증류, Instruct: 광범위한 다중 소스 데이터, RL-Zero)를 15개의 작업과 4가지 텍스트 다양성 지표를 사용하여 분석하여 출력 다양성의 변화를 추적했습니다. 분석 결과, 출력 다양성 감소의 위치는 데이터 구성과 함께 변하는 것으로 나타났습니다. Think 경로에서는 지도 학습 단계에서 가장 많은 의미적 다양성이 감소했으며, DPO의 효과는 Instruct 경로에서 Think 경로보다 더 컸습니다. Think 모델에서 추론 시 체인 오브 소트 추론을 억제하면 어려운 작업에서 정확도가 감소하지만, 답변 수준의 다양성은 변하지 않았습니다. 이는 감소 현상이 생성 형식에 의해 강제되는 것이 아니라, 훈련 데이터를 통해 모델 가중치에 내재화된 현상임을 보여줍니다. 6가지 검증 가능한 작업에서 다양성 손실을 품질 관리 구성 요소(잘못된 출력 제거)와 잔여 구성 요소(정확한 출력 간의 진정한 좁아짐)로 분해한 결과, 분해는 작업에 따라 다르며, Think 모델은 전체적으로 더 많이 감소했지만 Instruct 모델보다 더 많은 정확한 답변의 다양성을 유지하는 것으로 나타났습니다. 이러한 결과는 출력 다양성 감소가 훈련 과정에서 데이터 구성에 의해 결정되며, 추론 시간에서만 해결할 수 없다는 것을 시사합니다.

Original Abstract

Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.

1 Citations

0 Influential

14.5 Altmetric

73.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!