2602.15733v1 Feb 17, 2026 cs.RO

MeshMimic: 3D 장면 복원을 통한 형상 인지 휴머노이드 동작 학습

MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction

Jian Tang

Citations: 53

h-index: 4

Qiang Zhang

Citations: 464

h-index: 13

Jiahao Ma

Citations: 9

h-index: 2

Peiran Liu

Citations: 43

h-index: 3

Zeran Su

Citations: 4

h-index: 1

Zifan Wang

Citations: 7

h-index: 2

Jingkai Sun

Citations: 361

h-index: 10

Wei Cui

Citations: 14

h-index: 2

Jialing Yu

Citations: 335

h-index: 11

Gang Han

Citations: 121

h-index: 6

Wen Zhao

Citations: 130

h-index: 7

Pihai Sun

Citations: 25

h-index: 2

Kangning Yin

Citations: 95

h-index: 4

Jiaxu Wang

Citations: 285

h-index: 10

Jiahang Cao

Citations: 40

h-index: 3

Lingfeng Zhang

Citations: 127

h-index: 5

Haotai Cheng

Citations: 132

h-index: 6

Junwei Liang

Citations: 12

h-index: 2

Renjing Xu

Citations: 564

h-index: 13

Yijie Guo

Citations: 120

h-index: 6

Shuai Shi

Citations: 23

h-index: 3

Xiaoshuai Hao

Citations: 73

h-index: 4

Yiding Ji

Citations: 39

h-index: 3

최근 몇 년 동안 휴머노이드 동작 제어 분야에서 괄목할 만한 발전이 있었으며, 특히 심층 강화 학습(RL)은 복잡하고 인간과 유사한 행동을 달성하는 데 중요한 역할을 해왔습니다. 그러나 휴머노이드 로봇의 높은 차원성과 복잡한 역학적 특성으로 인해 수동적인 동작 설계는 비실용적이며, 값비싼 동작 캡처(MoCap) 데이터에 대한 의존도가 높습니다. 이러한 데이터셋은 획득 비용이 많이 들 뿐만 아니라, 종종 주변 환경의 필요한 기하학적 정보를 포함하지 않습니다. 결과적으로, 기존의 동작 합성 프레임워크는 종종 동작과 장면이 분리되어 나타나며, 지형 인지 작업 중 접촉 미끄러짐이나 메시 침투와 같은 물리적 불일치를 야기합니다. 본 연구에서는 3D 장면 복원과 통합 지능을 결합하여 휴머노이드 로봇이 비디오로부터 직접

Original Abstract

Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks. In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.

2 Citations

0 Influential

6.5 Altmetric

34.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!