2603.22282v1 Mar 23, 2026 cs.CV

UniMotion: 동작-텍스트-이미지 이해 및 생성의 통합 프레임워크

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

Ziyi Wang

Citations: 58

h-index: 4

Xinshun Wang

Citations: 13

h-index: 3

Shuang Chen

Citations: 26

h-index: 3

Yang Cong

Citations: 108

h-index: 5

Mengyuan Liu

Citations: 33

h-index: 4

본 논문에서는 UniMotion을 제안합니다. UniMotion은 단일 아키텍처 내에서 인간의 동작, 자연어, RGB 이미지의 동시적인 이해와 생성을 위한 최초의 통합 프레임워크라고 할 수 있습니다. 기존의 통합 모델들은 제한적인 모달리티 하위 집합(예: 동작-텍스트 또는 정적 자세-이미지)만 처리하며, 주로 이산적인 토큰화에 의존하여 양자화 오류를 발생시키고 시간적 연속성을 방해합니다. UniMotion은 핵심 원칙, 즉 RGB와 동등한 수준의 1등 모달리티로서 동작을 취급함으로써 이러한 한계점을 극복합니다. 새로운 Cross-Modal Aligned Motion VAE (CMA-VAE)와 대칭적인 이중 경로 임베더는 공유된 LLM 백본 내에서 동작과 RGB에 대한 병렬적인 연속 경로를 구성합니다. 추론 시 이미지 없이도 시각-의미 정보를 동작 표현에 주입하기 위해, 우리는 Dual-Posterior KL Alignment (DPA)를 제안합니다. DPA는 비전 정보를 통합한 인코더의 풍부한 사후 분포를 동작 전용 인코더에 전달하여 학습을 돕습니다. 또한, 텍스트 감독만으로는 새로 도입된 동작 경로를 충분히 조정하기 어려운 초기 문제(cold-start problem)를 해결하기 위해, 우리는 Latent Reconstruction Alignment (LRA)라는 자기 지도 학습 사전 훈련 전략을 제안합니다. LRA는 밀집된 동작 잠재 변수를 명확한 조건으로 사용하여 임베더, 백본, 그리고 예측 모듈을 동시에 조정하여 모든 후속 작업에 대한 안정적인 동작 인식 기반을 구축합니다. UniMotion은 세 가지 모달리티 간의 모든-대-모든 이해, 생성, 편집 작업에 걸쳐 최첨단 성능을 달성하며, 특히 모달리티 간의 복합 작업에서 뛰어난 성능을 보입니다.

Original Abstract

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.

3 Citations

0 Influential

2.5 Altmetric

15.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!