2603.18532v1 Mar 19, 2026 cs.RO

생성적 3D 환경을 활용한 로봇 비전-언어-행동 강화 학습의 실용성 확장

Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

Zhizhong Su

Citations: 35

h-index: 2

Andrew Choi

Citations: 11

h-index: 2

Xinjie Wang

Citations: 128

h-index: 5

Wei Xu

Citations: 7

h-index: 1

강화 학습(RL)으로 훈련된 대규모 비전-언어 모델(VLM)의 뛰어난 성능은 로봇 분야의 비전-언어-행동(VLA) 모델을 미세 조정하는 데 유사한 접근 방식을 촉발했습니다. 최근 많은 연구에서는 실세계에서 직접 VLA 모델을 미세 조정하여 시뮬레이션-실세계 간 격차 문제를 해결하려 합니다. 하지만 실세계 RL은 시뮬레이션-실세계 문제를 회피하지만, 결과적으로 VLA 모델의 일반화 능력을 제한합니다. 왜냐하면 물리적인 환경에서의 장면 및 객체 다양성을 확장하는 것은 매우 어렵기 때문입니다. 이는 광범위하게 사전 훈련된 모델을 특정 장면에 과적합된 정책으로 변환하는 역설적인 결과를 초래합니다. 반면, 시뮬레이션 환경에서 훈련하면 다양한 장면을 활용할 수 있지만, 이러한 장면을 설계하는 것 또한 비용이 많이 듭니다. 본 연구에서는 3D 환경 생성 모델을 활용하여 VLA 모델을 일반화 능력을 손실하지 않고, 인적 자원을 절약하면서 강화 학습 방식으로 미세 조정할 수 있음을 보여줍니다. 언어 기반의 장면 설계 도구와 함께 이러한 모델을 사용하여 수백 개의 다양한 상호작용 장면을 생성하고, 각 장면은 고유한 객체와 배경을 포함하여 확장 가능하고 고도로 병렬적인 정책 학습을 가능하게 합니다. 사전 훈련된 모방 학습 모델을 기반으로, 제안하는 방법은 시뮬레이션 성공률을 9.7%에서 79.8%로 향상시키고, 작업 완료 시간을 1.25배 단축합니다. 또한, 생성된 디지털 트윈의 품질과 도메인 랜덤화를 통해 실세계 적용 성공률을 21.7%에서 75%로 향상시키고, 작업 완료 시간을 1.13배 단축하여 성공적인 시뮬레이션-실세계 전송을 입증합니다. 마지막으로, 3D 환경 생성 모델에서 얻을 수 있는 효과적으로 무한한 데이터를 활용하는 것이 제로샷 일반화 성능을 직접적으로 향상시킨다는 것을 ablation 연구를 통해 강조합니다.

Original Abstract

The strong performance of large vision-language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision-language-action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the resulting VLA, as scaling scene and object diversity in the physical world is prohibitively difficult. This leads to the paradoxical outcome of transforming a broadly pretrained model into an overfitted, scene-specific policy. Training in simulation can instead provide access to diverse scenes, but designing those scenes is also costly. In this work, we show that VLAs can be RL fine-tuned without sacrificing generality and with reduced labor by leveraging 3D world generative models. Using these models together with a language-driven scene designer, we generate hundreds of diverse interactive scenes containing unique objects and backgrounds, enabling scalable and highly parallel policy learning. Starting from a pretrained imitation baseline, our approach increases simulation success from 9.7% to 79.8% while achieving a 1.25$\times$ speedup in task completion time. We further demonstrate successful sim-to-real transfer enabled by the quality of the generated digital twins together with domain randomization, improving real-world success from 21.7% to 75% and achieving a 1.13$\times$ speedup. Finally, we further highlight the benefits of leveraging the effectively unlimited data from 3D world generative models through an ablation study showing that increasing scene diversity directly improves zero-shot generalization.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!