2605.29488v1 May 28, 2026 cs.CV

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Ruibing Hou
Ruibing Hou
Citations: 144
h-index: 6
Hong Chang
Hong Chang
Citations: 123
h-index: 5
Zhuo Li
Zhuo Li
Citations: 2
h-index: 1
Shiguang Shan
Shiguang Shan
Citations: 27
h-index: 2
Yiheng Li
Yiheng Li
Citations: 35
h-index: 2
Yingjie Chen
Yingjie Chen
Citations: 22
h-index: 1
Hao Liu
Hao Liu
Citations: 33
h-index: 2

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.

0 Citations
0 Influential
3 Altmetric
15.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!