2605.03821v1 May 05, 2026 cs.RO

RoboAlign-R1: 로봇 비디오 월드 모델을 위한 다중 모드 보상 정렬 기술

RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

Yuqiang Li

Citations: 759

h-index: 12

Yingli Tian

Citations: 197

h-index: 7

Fan Xu

Citations: 159

h-index: 5

Fan Zhang

Citations: 41

h-index: 4

Peng-Xiang Zhao

Citations: 19

h-index: 2

Qiu-wan Wang

Citations: 55

h-index: 3

Yizhou Zhao

Citations: 44

h-index: 3

Weiyan Wang

Citations: 1,450

h-index: 5

Xiaomeng Huang

Citations: 146

h-index: 8

Hao Wu

Citations: 21

h-index: 3

Yuan Gao

Citations: 129

h-index: 7

Kun Wang

Citations: 7

h-index: 1

Xiangxun Wu

Citations: 0

h-index: 0

기존의 로봇 비디오 월드 모델은 일반적으로 재구성 및 지각적 유사성과 같은 낮은 수준의 목표로 학습되며, 이는 로봇 의사 결정에 가장 중요한 능력, 즉 지시 사항 준수, 조작 성공 및 물리적 타당성과 제대로 정렬되지 않습니다. 또한, 이러한 모델은 장기적인 자기 회귀 예측 과정에서 오류가 누적되는 문제가 있습니다. 본 논문에서는 RoboAlign-R1이라는 프레임워크를 제안합니다. 이 프레임워크는 로봇 비디오 월드 모델에 대한 보상 정렬 후 학습과 안정적인 장기 예측 기능을 결합합니다. 우리는 네 가지 로봇 데이터 소스에서 수집된 10,000개의 주석이 달린 비디오-지시 쌍으로 구성된 벤치마크인 RobotWorldBench를 구축하고, 생성된 비디오에 대한 세밀한 6차원 평가를 제공하는 다중 모드 교사 모델인 RoboAlign-Judge를 학습시켰습니다. 그런 다음, 교사 모델을 효율적인 강화 학습 기반 후 학습을 위한 경량 학생 보상 모델로 변환했습니다. 장기 예측의 드리프트를 줄이기 위해, 우리는 학습이 필요 없는 추론 전략인 Sliding Window Re-encoding (SWR)을 도입했습니다. SWR은 생성 컨텍스트를 주기적으로 갱신합니다. 우리 연구의 내부 평가 프로토콜에서, RoboAlign-R1은 가장 강력한 기준 모델보다 전체 6차원 점수가 10.1% 향상되었습니다. 특히 조작 정확도가 7.5%, 지시 사항 준수가 4.6% 향상되었습니다. 이러한 성능 향상은 외부 VLM 기반 교차 검증 및 익명 인간 연구를 통해 뒷받침됩니다. 또한, SWR은 약 1%의 추가 지연 시간만 발생하면서 SSIM이 2.8% 향상되고 LPIPS가 9.8% 감소하여 장기 예측 품질을 향상시킵니다. 이러한 결과는 보상 정렬 후 학습과 안정적인 장기 디코딩이 로봇 비디오 월드 모델의 작업 일관성, 물리적 현실감 및 장기 예측 품질을 향상시키는 데 효과적임을 보여줍니다.

Original Abstract

Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement-learning-based post-training. To reduce long-horizon rollout drift, we further introduce Sliding Window Re-encoding (SWR), a training-free inference strategy that periodically refreshes the generation context. Under our in-domain evaluation protocol, RoboAlign-R1 improves the aggregate six-dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM-based cross-check and a blinded human study. Meanwhile, SWR improves long-horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward-aligned post-training and stabilized long-horizon decoding improve task consistency, physical realism, and long-horizon prediction quality in robot video world models.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!