2605.29488v1 May 28, 2026 cs.CV

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Ruibing Hou

Citations: 144

h-index: 6

Hong Chang

Citations: 123

h-index: 5

Zhuo Li

Citations: 2

h-index: 1

Shiguang Shan

Citations: 27

h-index: 2

Yiheng Li

Citations: 35

h-index: 2

Yingjie Chen

Citations: 22

h-index: 1

Hao Liu

Citations: 33

h-index: 2

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!