2603.09170v1 Mar 10, 2026 cs.RO

ZeroWBC: 인간의 시점 영상을 활용하여 학습하는 자연스러운 휴머노이드 제어

ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video

Dong Wang

Citations: 648

h-index: 10

Bin Zhao

Citations: 1,457

h-index: 16

Xuelong Li

Citations: 52

h-index: 4

Jiacheng Bao

Citations: 93

h-index: 4

Yucheng Xin

Citations: 18

h-index: 3

Yuyang Tian

Citations: 39

h-index: 3

Hao Yang

Citations: 11

h-index: 2

Haoming Song

Citations: 685

h-index: 9

휴머노이드 로봇이 다양한 환경에서 자연스럽게 상호작용하기 위한 전신 제어는 여전히 중요한 과제입니다. 최근 몇몇 연구에서 자율적인 휴머노이드 상호작용 제어를 선보였지만, 이들은 제한적인 동작 패턴에 의존하며, 값비싼 원격 조작 데이터를 필요로 합니다. 이러한 방식은 앉거나 차는 것과 같이 더욱 인간적인 동작을 수행하는 데 한계가 있습니다. 또한, 실제 로봇의 원격 조작 데이터를 확보하는 것은 매우 비용이 많이 들고 시간이 오래 걸립니다. 이러한 제약 사항을 해결하기 위해, 우리는 ZeroWBC라는 새로운 프레임워크를 제안합니다. ZeroWBC는 대규모 로봇 원격 조작 데이터 없이, 인간의 시점 영상을 직접 활용하여 자연스러운 휴머노이드 제어 정책을 학습합니다. 구체적으로, 저희의 접근 방식은 먼저 비전-언어 모델(VLM)을 텍스트 지시와 인간의 시점 정보를 기반으로 미래의 전신 동작을 예측하도록 미세 조정합니다. 그런 다음, 생성된 동작을 실제 로봇의 관절에 매핑하고, 저희가 개발한 강력하고 일반적인 동작 추적 정책을 통해 휴머노이드 로봇의 전신 제어를 수행합니다. Unitree G1 휴머노이드 로봇에 대한 광범위한 실험 결과, 저희 방법이 기존 방식보다 동작의 자연스러움과 다양성 측면에서 우수한 성능을 보임을 확인했습니다. 이를 통해 전신 휴머노이드 제어를 위한 원격 조작 데이터 수집의 부담을 줄이는 파이프라인을 구축했으며, 이는 일반적인 휴머노이드 전신 제어를 위한 확장 가능하고 효율적인 패러다임을 제시합니다.

Original Abstract

Achieving versatile and naturalistic whole-body control for humanoid robot scene-interaction remains a significant challenge. While some recent works have demonstrated autonomous humanoid interactive control, they are constrained to rigid locomotion patterns and expensive teleoperation data collection, lacking the versatility to execute more human-like natural behaviors such as sitting or kicking. Furthermore, acquiring the necessary real robot teleoperation data is prohibitively expensive and time-consuming. To address these limitations, we introduce ZeroWBC, a novel framework that learns a natural humanoid visuomotor control policy directly from human egocentric videos, eliminating the need for large-scale robot teleoperation data and enabling natural humanoid robot scene-interaction control. Specifically, our approach first fine-tunes a Vision-Language Model (VLM) to predict future whole-body human motions based on text instructions and egocentric visual context, then these generated motions are retargeted to real robot joints and executed via our robust general motion tracking policy for humanoid whole-body control. Extensive experiments on the Unitree G1 humanoid robot demonstrate that our method outperforms baseline approaches in motion naturalness and versatility, successfully establishing a pipeline that eliminates teleoperation data collection overhead for whole-body humanoid control, offering a scalable and efficient paradigm for general humanoid whole-body control.

4 Citations

0 Influential

8 Altmetric

44.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!