2604.03552v1 Apr 04, 2026 cs.RO

CRAFT: 비디오 확산 모델을 활용한 양손 로봇 데이터 생성

CRAFT: Video Diffusion for Bimanual Robot Data Generation

Jason Chen

Citations: 9

h-index: 1

I. Liu

Citations: 60

h-index: 3

Gaurav Sukhatme

Citations: 56

h-index: 4

Daniel Seita

Citations: 14

h-index: 2

양손 로봇 학습은 실제 데이터의 비용과 제한적인 시각적 다양성 때문에 근본적인 제약을 받습니다. 이는 다양한 시점, 물체 구성, 로봇 형태에 대한 정책의 견고성을 저해합니다. 본 연구에서는 비디오 확산 트랜스포머(Video Diffusion Transformers)를 활용하여 확장 가능한 양손 로봇 데모 생성 프레임워크인 Canny-guided Robot Data Generation (CRAFT)을 제시합니다. CRAFT는 시간적으로 일관성 있는 조작 비디오를 생성하는 동시에 동작 레이블을 생성합니다. 시뮬레이터에서 생성된 경로에서 추출된 엣지 기반 구조적 단서를 기반으로 비디오 확산 모델을 제어함으로써, CRAFT는 물리적으로 타당한 경로 변형을 생성하고, 물체 자세 변경, 카메라 시점 변경, 조명 및 배경 변화, 로봇 형태 간 이전, 다중 뷰 합성과 같은 통합 증강 파이프라인을 지원합니다. 사전 훈련된 비디오 확산 모델을 활용하여 시뮬레이션된 비디오와 시뮬레이션 경로에서 추출된 동작 레이블을 사용하여 동작 일관성을 갖는 데모를 생성합니다. CRAFT는 몇 개의 실제 데모만으로 시작하여, 실제 로봇에서 데모를 반복할 필요 없이 방대한 양의 시각적으로 다양한 사실적인 훈련 데이터를 생성합니다 (Sim2Real). 시뮬레이션 및 실제 환경에서의 양손 로봇 작업에서 CRAFT는 기존의 증강 전략 및 단순한 데이터 확장에 비해 더 높은 성공률을 보여줍니다. 이는 확산 기반 비디오 생성 기술이 양손 조작 작업의 데모 다양성을 크게 확장하고 일반화 성능을 향상시킬 수 있음을 입증합니다. 본 연구의 프로젝트 웹사이트는 다음 주소에서 확인할 수 있습니다: https://craftaug.github.io/

Original Abstract

Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic training data, bypassing the need to replay demonstrations on the real robot (Sim2Real). Across simulated and real-world bimanual tasks, CRAFT improves success rates over existing augmentation strategies and straightforward data scaling, demonstrating that diffusion-based video generation can substantially expand demonstration diversity and improve generalization for dual-arm manipulation tasks. Our project website is available at: https://craftaug.github.io/

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!