2604.01001v1 Apr 01, 2026 cs.CV

EgoSim: 인체형 인터랙션 생성을 위한 1인칭 시점 세계 시뮬레이터

EgoSim: Egocentric World Simulator for Embodied Interaction Generation

Jiangmiao Pang

Citations: 678

h-index: 11

Jinkun Hao

Citations: 20

h-index: 3

Mingda Jia

Citations: 9

h-index: 2

Xudong Xu

Citations: 15

h-index: 2

Xihui Liu

Citations: 548

h-index: 8

Ruiyan Wang

Citations: 5

h-index: 2

Ran Yi

Citations: 130

h-index: 4

Lizhuang Ma

Citations: 152

h-index: 6

본 논문에서는 EgoSim을 소개합니다. EgoSim은 1인칭 시점 세계 시뮬레이터로, 공간적으로 일관성 있는 인터랙션 비디오를 생성하며, 지속적인 시뮬레이션을 위해 기본 3차원 장면 상태를 지속적으로 업데이트합니다. 기존의 1인칭 시점 시뮬레이터는 명시적인 3차원 기반을 결여하여 시점 변화 시 구조적 드리프트가 발생하거나, 장면을 정적으로 취급하여 다단계 인터랙션을 통해 세계 상태를 업데이트하지 못하는 문제가 있습니다. EgoSim은 3차원 장면을 업데이트 가능한 세계 상태로 모델링하여 이러한 한계를 극복합니다. 우리는 Geometry-action-aware Observation Simulation 모델을 통해 인체형 인터랙션을 생성하고, Interaction-aware State Updating 모듈을 통해 공간적 일관성을 확보합니다. 1인칭 시점 비디오 데이터 확보의 어려움으로 인해 발생하는 데이터 부족 문제를 해결하기 위해, 우리는 야외에서 촬영된 대규모 단안 1인칭 시점 비디오에서 정적 포인트 클라우드, 카메라 경로, 인체형 동작 데이터를 추출하는 확장 가능한 파이프라인을 설계했습니다. 또한, EgoCap이라는 저렴한 비용으로 실제 데이터를 수집할 수 있는, 보정되지 않은 스마트폰을 활용하는 데이터 캡처 시스템을 소개합니다. 광범위한 실험 결과, EgoSim은 시각적 품질, 공간적 일관성, 복잡한 장면 및 야외 환경에서의 숙련된 인터랙션에 대한 일반화 성능 측면에서 기존 방법보다 훨씬 우수한 성능을 보이며, 로봇 조작으로의 교차 인체형 전송을 지원합니다. 코드 및 데이터셋은 곧 공개될 예정이며, 프로젝트 페이지는 egosimulator.github.io에서 확인할 수 있습니다.

Original Abstract

We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!