2601.20838v2 Jan 28, 2026 cs.LG

사전 학습에서 비롯된 가치 편향이 보상 모델에 영향을 미친다

Reward Models Inherit Value Biases from Pretraining

Brian Christian

Citations: 44

h-index: 3

J. Thompson

Citations: 38

h-index: 3

V. Adam

Citations: 3

h-index: 1

Hannah Rose Kirk

Citations: 164

h-index: 7

Chris Summerfield

Citations: 186

h-index: 3

T. Dumbalska

Citations: 472

h-index: 9

Elle Michelle Yang

University of Oxford

Citations: 41

h-index: 2

보상 모델(RM)은 대규모 언어 모델(LLM)을 인간의 가치에 맞추는 데 핵심적인 역할을 하지만, 사전 학습 및 사후 학습된 LLM 자체만큼 많은 관심을 받지 못했습니다. RM은 LLM에서 초기화되므로, RM의 행동을 형성하는 표현을 상속받지만, 이러한 영향의 성격과 정도는 아직 충분히 연구되지 않았습니다. 본 연구에서는 검증된 심리 언어 데이터 코퍼스를 사용하여 10개의 선도적인 공개 가중치 RM을 종합적으로 분석한 결과, RM이 기반 모델에 따라 인간 가치의 여러 측면에서 상당한 차이를 보이는 것을 확인했습니다. "Big Two" 심리적 축을 사용하여, Llama RM은 "주체성(agency)"을, Gemma RM은 "관계성(communion)"을 강하게 선호하는 경향을 보임을 확인했습니다. 이러한 현상은 선호 데이터와 미세 조정 과정이 동일한 경우에도 나타나며, 이는 각각의 instruction-tuned 및 사전 학습 모델의 logits으로 인해 발생합니다. 이러한 log-probability의 차이 자체도 암묵적인 RM으로 공식화될 수 있으며, 이를 통해 활용 가능한 암묵적인 보상 점수를 도출하고, 이 점수 역시 동일한 주체성/관계성 차이를 보이는 것을 확인했습니다. 선호 데이터의 출처와 양을 조작하여 RM을 훈련하는 실험을 진행한 결과, 이러한 효과는 반복 가능할 뿐만 아니라 놀라울 정도로 지속적인 것으로 나타났습니다. RM은 인간의 선호를 나타내는 것을 목표로 설계되었지만, 본 연구의 증거는 RM의 결과물이 기반이 되는 사전 학습된 LLM의 영향을 받는다는 것을 보여줍니다. 본 연구는 사전 학습 단계에서의 안전 및 정렬 노력의 중요성을 강조하며, 오픈 소스 개발자들이 기반 모델을 선택할 때 성능뿐만 아니라 가치 또한 중요한 고려 사항임을 명확히 합니다.

Original Abstract

Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pretrained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the "Big Two" psychological axes, we show a robust preference of Llama RMs for "agency" and a corresponding robust preference of Gemma RMs for "communion." This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pretrained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers' choice of base model is as much a consideration of values as of performance.

2 Citations

0 Influential

4.5 Altmetric

24.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!