2603.02115v1 Mar 02, 2026 cs.RO

Robometer: 경로 비교를 통한 범용 로봇 보상 모델의 확장

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Jiahui Zhang

Citations: 12

h-index: 3

Anthony Liang

Citations: 43

h-index: 3

Yigit Korkmaz

Citations: 43

h-index: 2

Minyoung Hwang

Citations: 29

h-index: 2

Abrar Anwar

Citations: 228

h-index: 6

Sid Kaushik

Citations: 25

h-index: 2

Luke S. Zettlemoyer

Citations: 254

h-index: 3

Dieter Fox

Citations: 199

h-index: 5

Yu Xiang

Citations: 17

h-index: 2

Andreea Bobu

Citations: 763

h-index: 16

Stephen Tu

Citations: 35

h-index: 4

Erdem Biyik

Citations: 2,706

h-index: 22

Adit Shah

Citations: 114

h-index: 5

Alex Huang

Citations: 4

h-index: 1

Anqi Li

Citations: 9

h-index: 2

Abhishek Gupta

Citations: 119

h-index: 4

Jesse Zhang

Citations: 99

h-index: 3

범용 로봇 보상 모델은 일반적으로 전문가 데모를 기반으로 절대적인 작업 진행 상황을 예측하도록 훈련되며, 이는 프레임 수준의 로컬 감독만을 제공합니다. 이러한 방식은 전문가 데모에서는 효과적이지만, 실패 및 최적화되지 않은 경로가 풍부하고 상세한 진행 상황 레이블을 지정하기 어려운 대규모 로봇 데이터셋에는 적용하기 어렵습니다. 본 논문에서는 경로 내 진행 상황에 대한 감독과 경로 간 선호도에 대한 감독을 결합하여 확장 가능한 보상 모델링 프레임워크인 Robometer를 소개합니다. Robometer는 두 가지 목표를 가지고 훈련됩니다. 첫째, 전문가 데이터를 기준으로 보상 크기를 결정하는 프레임 수준의 진행 상황 손실, 그리고 둘째, 동일 작업의 경로 전체에 걸쳐 전역적 순서 제약을 부과하여 실제 및 증강된 실패 경로로부터 효과적인 학습을 가능하게 하는 경로 비교 선호도 손실입니다. 이러한 방식을 대규모로 적용하기 위해, 다양한 로봇 하드웨어 및 작업(실제 및 최적화되지 않은 데이터 포함)을 포함하는 1백만 개 이상의 경로를 담은 보상 학습 데이터셋인 RBM-1M을 구축했습니다. 벤치마크 및 실제 평가에서 Robometer는 기존 방법보다 더 일반화 가능한 보상 함수를 학습하며, 다양한 하위 작업에서 로봇 학습 성능을 향상시킵니다. 코드, 모델 가중치 및 비디오는 https://robometer.github.io/ 에서 확인할 수 있습니다.

Original Abstract

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

2 Citations

1 Influential

11 Altmetric

59.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!