2602.11902v1 Feb 12, 2026 cs.LG

참조 기반 선호도 최적화에서의 불일치 완화

Mitigating Mismatch within Reference-based Preference Optimization

Suqin Yuan

Citations: 79

h-index: 4

Xin Yu

Citations: 2

h-index: 1

Jiyang Zheng

Citations: 30

h-index: 3

Lei Feng

Citations: 51

h-index: 3

Dadong Wang

Citations: 55

h-index: 5

Ivor W. Tsang

Citations: 20

h-index: 2

Tongliang Liu

Citations: 41

h-index: 5

직접 선호도 최적화(DPO)는 대규모 언어 모델의 오프라인 선호도 정렬을 위한 사실상의 표준이 되었지만, 참조 정책(reference policy)에 대한 의존성으로 인해 치명적인 문제가 발생한다. DPO는 참조를 기준으로 각 업데이트에 가중치를 부여하며, 이는 신뢰 영역 내에서 업데이트를 정규화하여 학습을 안정화한다. 이러한 의존성은 참조 모델이 거부된 응답을 더 선호하는 비관적 쌍(pessimistic pairs)에서 문제가 된다. 이러한 쌍의 경우, DPO는 정책 마진($Δ_θ$)이 참조 마진($Δ_{\mathrm{ref}}$)을 단순히 넘어서기만 하면 정책이 여전히 틀린 상태($Δ_θ<0$)임에도 불구하고 기울기(gradient)를 조기에 감소시킨다. 우리는 이러한 실패를 '조기 만족(premature satisfaction)'이라 부르며, 이는 학습-추론 불일치의 구체적인 형태이다. 참조가 없는 목적 함수는 절대 마진을 최적화하여 이러한 불일치를 제거하지만, 참조의 안정화 신호를 버려야 하는 대가를 치른다. 우리는 참조를 조건부로 적용하는 DPO의 드롭인(drop-in) 수정 방식인 Hybrid-DPO(HyPO)를 통해 이러한 문제를 완화한다. HyPO는 참조가 낙관적이거나 중립적일 때는 DPO와 정확히 동일하게 동작하며, 비관적일 때는 $Δ_θ-Δ_{\mathrm{ref}}$를 $Δ_θ-\max\{0,Δ_{\mathrm{ref}}\}$로 대체하여 참조를 중립적인 것으로 간주한다. 이 한 줄의 변경은 DPO의 목적 함수 형태와 계산 비용을 보존하면서도 비관적 쌍에 대한 예제별 학습 신호를 엄격하게 강화한다. 비관적인 참조 신호를 조건부로 편향 제거(debiasing)함으로써 HyPO는 조기 만족 현상을 완화한다. 경험적으로 HyPO는 선호도 정렬 전반에 걸쳐 추론 정렬 지표를 개선하고 더 높은 쌍별 승률을 달성한다. 우리의 연구 결과는 참조 신호를 버리기보다는 조건부로 편향을 제거함으로써 직접 선호도 정렬을 더욱 향상시킬 수 있다는 증거를 제공한다.

Original Abstract

Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models, but its reliance on a reference policy introduces a critical tension. DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region. This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response. For these pairs, DPO prematurely attenuates the gradient as soon as the policy margin ($Δ_θ$) merely beats the reference margin ($Δ_{\mathrm{ref}}$) even if the policy is still wrong ($Δ_θ<0$). We name this failure premature satisfaction, which is a concrete form of the training-inference mismatch. Reference-free objectives remove this mismatch by optimizing the absolute margin, but at the cost of discarding the stabilizing signal of the reference. We mitigate this tension with Hybrid-DPO (HyPO), a drop-in modification to DPO that applies reference conditionally: HyPO behaves exactly like DPO when the reference is optimistic or neutral, and it treats the reference as neutral when it is pessimistic by replacing $Δ_θ-Δ_{\mathrm{ref}}$ with $Δ_θ-\max\{0,Δ_{\mathrm{ref}}\}$. This one-line change strictly strengthens per-example learning signals on pessimistic pairs while preserving DPO's objective form and computational cost. By conditionally debiasing the pessimistic reference signal, HyPO mitigates premature satisfaction; empirically, across preference alignment, HyPO improves inference-aligned metrics and achieves higher pairwise win rates. Our results provide evidence that direct preference alignment could be enhanced by conditionally debiasing the reference signal, rather than discarding it.

2 Citations

1 Influential

2.5 Altmetric

16.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!