2604.18239v1 Apr 20, 2026 cs.LG

가능도 이동 현상 너머의 분리된 선호도 최적화 동향

Towards Disentangled Preference Optimization Dynamics Beyond Likelihood Displacement

Qibin Zhao

Citations: 18

h-index: 2

Junmei Yang

Citations: 41

h-index: 4

Delu Zeng

Citations: 64

h-index: 5

John W. Paisley

Citations: 61

h-index: 4

Min Chen

Citations: 6

h-index: 2

Zhou Wang

Citations: 13

h-index: 2

Wei Chen

Citations: 75

h-index: 5

Yubing Wu

Citations: 109

h-index: 3

선호도 최적화는 대규모 언어 모델(LLM)을 인간의 선호도에 맞추는 데 널리 사용됩니다. 그러나 많은 마진 기반 목표들이 선택된 응답과 함께 거부된 응답을 억제하는 경향이 있는데, 이를 가능도 이동(likelihood displacement)이라고 합니다. 현재까지 이러한 현상을 방지하는 일반적인 메커니즘은 존재하지 않습니다. 본 연구에서는 선호도 최적화에 대한 통합적인 *인센티브-점수 분해(incentive-score decomposition)* 방법을 제시하여, 다양한 목표들이 동일한 로컬 업데이트 방향을 공유하며, 스칼라 가중치 계수만 다른 것을 밝힙니다. 이러한 분해를 바탕으로, 선택/거부된 가능도의 동역학을 분석하여 *분리 대역(disentanglement band, DB)*이라는 간단하고 검증 가능한 조건을 제시합니다. 이 조건은 학습 과정에서 가능도 이동을 방지하고, 초기 일시적인 현상 이후에는 패배자를 억제하면서 승자를 유지하는 선호 경로를 구현할 수 있는 조건을 나타냅니다. DB를 활용하여, 기본 목표를 재설계하지 않고도 DB를 만족시키고 가능도 이동을 완화하기 위해 선택/거부 업데이트의 균형을 적응적으로 재조정하는 *보상 보정(reward calibration, RC)* 방법을 제안합니다. 실증적인 결과는 RC가 학습을 보다 분리된 동역학으로 이끌고, 다양한 목표에 걸쳐 다운스트림 성능을 향상시키는 경향이 있음을 보여줍니다. 본 연구의 코드는 https://github.com/IceyWuu/DisentangledPreferenceOptimization 에서 확인할 수 있습니다.

Original Abstract

Preference optimization is widely used to align large language models (LLMs) with human preferences. However, many margin-based objectives suppress the chosen response along with the rejected one, a phenomenon known as likelihood displacement, and no general mechanism currently prevents this across objectives. We bridge this gap by presenting a unified \emph{incentive-score decomposition} of preference optimization, revealing that diverse objectives share identical local update directions and differ only in their scalar weighting coefficients. Building on this decomposition, by analyzing the dynamics of the chosen/rejected likelihoods, we identify the \emph{disentanglement band} (DB), a simple, testable condition that characterizes when training can avoid likelihood displacement by realizing the preferred pathway: suppressing the loser while maintaining the winner, possibly after an initial transient. Leveraging the DB, we propose a plug-and-play \emph{reward calibration} (RC) that adaptively rebalances chosen versus rejected updates to satisfy the DB and mitigate likelihood displacement, without redesigning the base objective. Empirical results show that RC steers training toward more disentangled dynamics and often improves downstream performance across a range of objectives. Our code is available at https://github.com/IceyWuu/DisentangledPreferenceOptimization.

0 Citations

0 Influential

22.5 Altmetric

112.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!