2604.19734v1 Apr 21, 2026 cs.RO

UniT: 인간-휴머노이드 정책 학습 및 세계 모델링을 위한 통합된 물리 언어

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Yi Chen

Citations: 349

h-index: 9

Yuying Ge

Citations: 4,479

h-index: 28

Yixiao Ge

Citations: 479

h-index: 12

Boyu Chen

Citations: 119

h-index: 7

Lu Qiu

Citations: 76

h-index: 4

Jerry Bai

Citations: 0

h-index: 0

휴머노이드 기반 모델의 성능 향상은 로봇 데이터 부족으로 인해 어려움을 겪고 있습니다. 방대한 1인칭 인간 데이터를 활용하는 것은 확장 가능한 대안이 될 수 있지만, 키네마틱 불일치로 인해 서로 다른 로봇 플랫폼 간의 정보 격차를 해소하는 것은 여전히 중요한 과제입니다. 본 연구에서는 UniT (Unified Latent Action Tokenizer via Visual Anchoring)라는 프레임워크를 제안합니다. UniT는 인간과 휴머노이드 간의 정보 전달을 위한 통합된 물리 언어를 구축하며, 다양한 키네마틱 구조가 공통적인 시각적 결과를 공유한다는 철학에 기반합니다. UniT는 3가지 분기 구조를 가진 교차 재구성 메커니즘을 사용합니다. 먼저, 액션은 시각 정보를 예측하여 키네마틱을 물리적 결과와 연결하고, 시각 정보는 액션을 재구성하여 관련 없는 시각적 요소를 제거합니다. 동시에, 융합 분기는 정제된 정보를 결합하여 로봇 플랫폼에 독립적인 물리적 의도의 공유된 이산 잠재 공간을 생성합니다. UniT는 다음과 같은 두 가지 방식으로 검증되었습니다. 1) 정책 학습 (VLA-UniT): 통합된 토큰을 예측함으로써, 다양한 인간 데이터를 효과적으로 활용하여 데이터 효율성을 극대화하고, 휴머노이드 시뮬레이션 벤치마크 및 실제 환경에서 뛰어난 성능과 일반화 능력을 달성했습니다. 특히, 사전 학습된 모델을 사용하지 않고도 새로운 작업에 적용하는 '제로샷' 작업 전송이 가능함을 보여주었습니다. 2) 세계 모델링 (WM-UniT): 통합된 토큰을 조건으로 사용하여 서로 다른 로봇 플랫폼 간의 동역학을 정렬함으로써, 직접적인 인간-휴머노이드 액션 전송을 가능하게 합니다. 이러한 정렬을 통해 인간 데이터가 휴머노이드 비디오 생성에 필요한 액션 제어 능력을 향상시키는 데 활용될 수 있습니다. 궁극적으로, UniT는 높은 수준의 로봇 플랫폼 간 정렬된 표현을 유도함으로써 (t-SNE 시각화를 통해 인간과 휴머노이드의 특징이 공유된 공간으로 수렴하는 것을 경험적으로 확인), 방대한 인간 지식을 일반적인 휴머노이드 능력으로 전환할 수 있는 확장 가능한 방법을 제공합니다.

Original Abstract

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

0 Citations

0 Influential

14 Altmetric

70.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!