2601.22083v2 Jan 29, 2026 cs.LG

오프라인 선호도 최적화를 위한 잠재적 적대적 정규화

Latent Adversarial Regularization for Offline Preference Optimization

Sanmi Koyejo

Citations: 3,864

h-index: 23

Enyi Jiang

Citations: 31

h-index: 3

Yibo Zhang

Citations: 149

h-index: 5

Ying Xu

Citations: 4,367

h-index: 10

Andreas Haupt

Citations: 40

h-index: 4

N. Amato

Citations: 3

h-index: 1

인간 피드백을 통한 학습은 일반적으로 토큰 수준의 정규화를 통해 정책 업데이트를 제한하는 선호도 최적화에 의존합니다. 그러나 언어 모델의 선호도 최적화는 특히 어려운 과제입니다. 왜냐하면 토큰 공간에서의 유사성이 반드시 의미적 또는 행동적 유사성을 의미하지 않기 때문입니다. 이러한 문제를 해결하기 위해, 우리는 언어 모델 선호도 최적화를 위한 잠재 공간 정규화를 활용합니다. 우리는 정책 모델과 참조 모델의 내부 표현 사이의 차이를 페널티로 부과하여 잠재 공간 정규화를 달성하는 GANPO를 소개합니다. 잠재 표현은 명시적인 확률 밀도와 관련이 없기 때문에, GAN에서 영감을 받은 적대적 접근 방식을 채택하여 잠재 공간의 차이를 최소화합니다. 우리는 GANPO를 기존의 오프라인 선호도 최적화 목표에 정규화기로 통합했습니다. 다양한 모델 아키텍처 및 작업에 대한 실험 결과, 잠재 공간 정규화를 통해 일관된 성능 향상을 확인할 수 있었습니다. 또한, GANPO에 의해 유도되는 추론 편향을 토큰 수준 정규화에서 발생하는 편향과 비교한 결과, GANPO는 분포 변화 및 노이즈 상황에서 더 강력한 구조적 피드백을 제공하며, 동시에 다운스트림 성능은 유사한 수준을 유지하면서 경미한 계산 오버헤드만 발생합니다.

Original Abstract

Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.

0 Citations

0 Influential

11.5 Altmetric

57.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!