2601.05241v1 Jan 08, 2026 cs.CV

RoboVIP: 시각적 특징 프롬프팅을 활용한 다중 시점 비디오 생성으로 로봇 조작 성능 향상

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Boyang Wang

Citations: 54

h-index: 3

Haoran Zhang

Citations: 10

h-index: 2

Shujie Zhang

Citations: 25

h-index: 2

Jinkun Hao

Citations: 20

h-index: 3

Mingda Jia

Citations: 9

h-index: 2

Qi Lv

Citations: 0

h-index: 0

Yucheng Mao

Citations: 135

h-index: 2

Zhaoyang Lyu

Citations: 6

h-index: 1

Jia Zeng

Citations: 538

h-index: 10

Xudong Xu

Citations: 15

h-index: 2

Jiangmiao Pang

Citations: 342

h-index: 10

효과적인 로봇 제어 정책을 학습하는 데에는 다양한 데이터의 양과 질이 매우 중요합니다. 그러나 하드웨어 및 물리적 환경 제약으로 인해 다양한 환경에서 대규모의 실제 로봇 조작 데이터를 수집하는 것은 여전히 어려운 과제입니다. 최근 연구에서는 텍스트 프롬프트 기반 이미지 확산 모델을 사용하여 시각적 관찰의 배경 및 테이블 위 물체를 변경함으로써 조작 데이터를 증강하는 방법을 사용합니다. 그러나 이러한 접근 방식은 최첨단 제어 모델에 필요한 다중 시점 및 시간적 일관성을 갖는 관찰 데이터를 충분히 고려하지 못하는 경우가 많습니다. 또한, 텍스트 프롬프트만으로는 장면 설정을 명확하게 지정하기 어렵습니다. 따라서 확산 모델에 명시적인 시각적 지침을 제공하기 위해, 원하는 장면 설정을 생성하도록 예제 이미지를 조건 입력으로 제공하는 시각적 특징 프롬프팅 기법을 제안합니다. 이를 위해, 대규모 로봇 데이터 세트에서 시각적 특징 풀을 구축하는 확장 가능한 파이프라인을 개발했습니다. 제안하는 방법으로 증강된 조작 데이터를 사용하여 하위 작업의 시각-언어-행동 및 시각-운동 제어 모델을 학습한 결과, 시뮬레이션 환경과 실제 로봇 환경 모두에서 일관된 성능 향상을 얻을 수 있었습니다.

Original Abstract

The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!