2602.03310v1 Feb 03, 2026 cs.RO

RDT2: UMI 데이터의 0으로 수렴하는 확장 한계를 탐색하여 제로샷 크로스-엠바디먼트 일반화를 달성

RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization

Songming Liu

Citations: 1,580

h-index: 10

Bangguo Li

Citations: 649

h-index: 3

Kai Ma

Citations: 128

h-index: 4

Lingxuan Wu

Citations: 680

h-index: 4

Hengkai Tan

Citations: 823

h-index: 7

Ouyang Xiao

Citations: 100

h-index: 4

Hang Su

Citations: 155

h-index: 5

Jun Zhu

Citations: 182

h-index: 5

비전-언어-액션(VLA) 모델은 범용 로봇에 대한 잠재력을 가지고 있지만, 현재 데이터 부족, 아키텍처 비효율성, 그리고 다양한 하드웨어 플랫폼 간의 일반화 불능이라는 문제점에 직면해 있습니다. 본 논문에서는 70억 개의 파라미터를 가진 VLM을 기반으로 구축된 로봇 기반 모델인 RDT2를 소개합니다. RDT2는 개방형 어휘 작업을 위한 새로운 로봇 시스템에 대한 제로샷 배포를 가능하게 합니다. 이를 위해, 우리는 향상된, 엠바디먼트-불특정(embodiment-agnostic) 유니버설 매니퓰레이션 인터페이스(UMI)를 사용하여 다양한 로봇 시스템에서 1만 시간 이상의 시연 데이터를 포함하는 가장 큰 오픈 소스 로봇 데이터 세트를 수집했습니다. 우리의 접근 방식은 잔차 벡터 양자화(RVQ), 플로우 매칭, 그리고 증류를 사용하여 이산적인 언어 지식을 연속적인 제어로 일치시키는 새로운 세 단계의 훈련 방법을 사용하며, 이를 통해 실시간 추론을 가능하게 합니다. 결과적으로, RDT2는 아직 보지 못한 객체, 장면, 명령어, 그리고 로봇 플랫폼에 대해 동시에 제로샷으로 일반화되는 최초의 모델 중 하나입니다. 또한, RDT2는 탁구 게임과 같은 숙련된, 장기적인, 그리고 동적인 후속 작업에서 최첨단 모델보다 우수한 성능을 보입니다. 자세한 내용은 https://rdt-robotics.github.io/rdt2/ 에서 확인할 수 있습니다.

Original Abstract

Vision-Language-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets--over 10,000 hours of demonstrations in diverse families--using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state-of-the-art baselines in dexterous, long-horizon, and dynamic downstream tasks like playing table tennis. See https://rdt-robotics.github.io/rdt2/ for more information.

18 Citations

1 Influential

5 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!