2602.10430v1 Feb 11, 2026 cs.LG

반발의 굴레를 깨고: 오프라인 생성형 추천을 위한 낙관적 분포적 강건 정책 최적화

Breaking the Curse of Repulsion: Optimistic Distributionally Robust Policy Optimization for Off-Policy Generative Recommendation

Y. Huo

Citations: 269

h-index: 7

Jie Jiang

Citations: 34

h-index: 3

Changping Wang

Citations: 15

h-index: 1

Jun Zhang

Citations: 13

h-index: 2

Xiangxin Zhan

Citations: 11

h-index: 2

정책 기반 강화 학습(RL)은 순차적 사용자 상호 작용을 최적화하는 생성형 추천 분야에서 주류 패러다임으로 자리 잡았습니다. 그러나 오프라인 과거 로그에 적용할 때, 이러한 방법들은 심각한 모델 붕괴를 야기하는 중요한 문제점을 드러냅니다. 먼저, 우리는 '반발 최적화의 발산 이론'을 제시하여, 부정적인 경사 업데이트가 오프라인 학습 과정에서 본질적으로 지수적인 강도 폭발을 유발한다는 것을 밝힙니다. 이 이론은 기존 방법들의 근본적인 딜레마를 설명하며, 분산 감소와 노이즈 모방을 동시에 해결할 수 없음을 드러냅니다. 이러한 문제를 해결하기 위해, 우리는 노이즈가 섞인 행동 정책 내에 잠재된 고품질 분포를 엄격하게 식별하는 것이 핵심 솔루션이라고 주장합니다. 이에 따라, 우리는 목적 함수를 '낙관적 분포적 강건 최적화(DRO)' 문제로 재정의합니다. 이 재정의를 바탕으로, 우리는 '분포적 강건 정책 최적화(DRPO)'를 제안합니다. 우리는 하드 필터링이 이 DRO 목적 함수의 정확한 해임을 증명하며, 이를 통해 DRPO는 고품질 행동을 최적으로 복원하면서 분산을 유발하는 노이즈를 엄격하게 제거할 수 있음을 보여줍니다. 광범위한 실험 결과는 DRPO가 다양한 품질의 추천 벤치마크에서 최첨단 성능을 달성한다는 것을 입증합니다.

Original Abstract

Policy-based Reinforcement Learning (RL) has established itself as the dominant paradigm in generative recommendation for optimizing sequential user interactions. However, when applied to offline historical logs, these methods suffer a critical failure: the dominance of low-quality data induces severe model collapse. We first establish the Divergence Theory of Repulsive Optimization, revealing that negative gradient updates inherently trigger exponential intensity explosion during off-policy training. This theory elucidates the inherent dilemma of existing methods, exposing their inability to reconcile variance reduction and noise imitation. To break this curse, we argue that the solution lies in rigorously identifying the latent high-quality distribution entangled within the noisy behavior policy. Accordingly, we reformulate the objective as an Optimistic Distributionally Robust Optimization (DRO) problem. Guided by this formulation, we propose Distributionally Robust Policy Optimization (DRPO). We prove that hard filtering is the exact solution to this DRO objective, enabling DRPO to optimally recover high-quality behaviors while strictly discarding divergence-inducing noise. Extensive experiments demonstrate that DRPO achieves state-of-the-art performance on mixed-quality recommendation benchmarks.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!