2606.09124v1 Jun 08, 2026 cs.AI

A Regret Minimization Framework on Preference Learning in Large Language Models

Moontae Lee
Moontae Lee
Citations: 241
h-index: 7
Jungwoo Lee
Jungwoo Lee
Citations: 38
h-index: 3
Suhwan Kim
Suhwan Kim
Citations: 17
h-index: 3
Taehyun Cho
Taehyun Cho
Citations: 62
h-index: 6
Geon-hyeong Kim
Geon-hyeong Kim
Citations: 446
h-index: 8
Yu-Mi Kim
Yu-Mi Kim
Citations: 0
h-index: 0
Youngsoo Jang
Youngsoo Jang
Citations: 85
h-index: 4

Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.

0 Citations
0 Influential
4 Altmetric
20.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!