2603.08572v1 Mar 09, 2026 cs.RO

MetaWorld-X: VLM 기반 전문가 조율을 통한 계층적 세계 모델링 - 인간형 로봇의 동시 이동 및 조작 (로코-매니퓰레이션)을 위한 방법

MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation

Jianwei Zhang

Citations: 110

h-index: 6

Yutong Shen

Citations: 26

h-index: 3

Hangxu Liu

Citations: 42

h-index: 3

Penghui Liu

Citations: 8

h-index: 2

Jiashuo Luo

Citations: 0

h-index: 0

Yongkang Zhang

Citations: 0

h-index: 0

Rex Morvley

Citations: 0

h-index: 0

Chenfanfu Jiang

Citations: 1,369

h-index: 16

Lei Zhang

Citations: 100

h-index: 5

인간형 로봇이 동시에 이동과 조작을 수행하는 로코-매니퓰레이션 작업을 위한 자연스럽고 안정적이며, 복합적인 일반화 능력을 갖춘 전체 로봇 제어 정책을 학습하는 것은 로봇 공학 분야의 근본적인 과제입니다. 기존의 강화 학습 접근 방식은 일반적으로 여러 기술을 습득하기 위해 단일 통합 정책에 의존하는데, 이는 고 자유도 시스템에서 기술 간의 경사 간섭 및 동작 패턴 충돌을 야기하는 경우가 많습니다. 그 결과, 생성된 동작은 종종 부자연스러운 움직임을 보이며, 안정성이 제한적이고 복잡한 작업 조합에 대한 일반화 능력이 떨어집니다. 이러한 한계점을 극복하기 위해, 본 연구에서는 인간형 로봇 제어를 위한 계층적 세계 모델 프레임워크인 MetaWorld-X를 제안합니다. 분할 정복 원칙에 따라, 본 방법은 복잡한 제어 문제를 특화된 전문가 정책(Specialized Expert Policies, SEP)의 집합으로 분해합니다. 각 전문가는 인간 동작 사전 지식을 활용한 모방 제약 강화 학습을 통해 훈련되며, 생체역학적으로 일관된 유도 편향을 도입하여 자연스럽고 물리적으로 타당한 동작 생성을 보장합니다. 이러한 기반을 바탕으로, 시각-언어 모델(VLM)에 의해 관리되는 지능형 라우팅 메커니즘(Intelligent Routing Mechanism, IRM)을 추가 개발하여 의미 기반의 전문가 조합을 가능하게 합니다. VLM 기반 라우터는 고수준 작업 의미에 따라 전문가 정책을 동적으로 통합하여, 다단계 로코-매니퓰레이션 작업에서 복합적인 일반화 및 적응적인 실행을 지원합니다.

Original Abstract

Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!