2601.09236v2 Jan 14, 2026 cs.LG

순위 평균 제곱 오차를 이용한 보상 학습

Reward Learning through Ranking Mean Squared Error

Chaitanya Kharyal

Citations: 46

h-index: 3

Calarina Muslimani

Citations: 16

h-index: 2

Matthew E. Taylor

Citations: 10

h-index: 2

보상 설계는 강화 학습(RL)을 실제 문제에 적용하는 데 있어 중요한 장애물로 남아 있습니다. 인기 있는 대안은 보상 학습으로, 여기서 보상 함수는 사람이 제공하는 피드백으로부터 추론되며, 수동으로 지정되지 않습니다. 최근 연구에서는 전통적인 이분법적 선호도가 아닌, 등급 형태의 인간 피드백으로부터 보상 함수를 학습하는 방법을 제안하여, 더욱 풍부하고 잠재적으로 인지적 부담이 적은 감독을 가능하게 합니다. 이러한 패러다임을 바탕으로, 저희는 새로운 등급 기반 RL 방법인 Ranked Return Regression for RL (R4)을 소개합니다. R4의 핵심은 교사(teacher)가 제공하는 등급을 순서형 목표로 취급하는 새로운 순위 평균 제곱 오차(rMSE) 손실 함수를 사용하는 것입니다. 저희의 접근 방식은 경로-등급 쌍 데이터 세트로부터 학습하는데, 여기서 각 경로는 '나쁨', '중립', '좋음'과 같은 이산적인 등급으로 레이블링됩니다. 각 훈련 단계에서, 저희는 경로 집합을 샘플링하고, 해당 경로들의 수익(return)을 예측하고, 미분 가능한 정렬 연산자(soft ranks)를 사용하여 이를 순위화합니다. 그런 다음, 결과적인 soft rank와 교사의 등급 사이의 평균 제곱 오차를 최소화합니다. 기존의 등급 기반 접근 방식과 달리, R4는 형식적인 보장을 제공합니다. 즉, 특정 가정을 만족하면, R4의 해 집합은 증명적으로 최소이며 완전합니다. 시뮬레이션된 인간 피드백을 사용하여, 저희는 R4가 OpenAI Gym 및 DeepMind Control Suite의 로봇 이동 벤치마크에서 기존의 등급 및 선호도 기반 RL 방법과 일관되게 동등하거나 더 나은 성능을 보이며, 훨씬 적은 피드백만을 필요로 한다는 것을 입증했습니다.

Original Abstract

Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g., "bad," "neutral," "good"). At each training step, we sample a set of trajectories, predict their returns, and rank them using a differentiable sorting operator (soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the teacher's ratings. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks from OpenAI Gym and the DeepMind Control Suite, while requiring significantly less feedback.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!