2606.13289v1 Jun 11, 2026 cs.CV

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Changlin Li
Changlin Li
Citations: 1,510
h-index: 4
Zhao Zhong
Zhao Zhong
Citations: 195
h-index: 4
Junzhe Li
Junzhe Li
Citations: 254
h-index: 3
Liefeng Bo
Liefeng Bo
Citations: 183
h-index: 5
Tao Huang
Tao Huang
Citations: 93
h-index: 2
Miles Yang
Miles Yang
Citations: 269
h-index: 3
Guozhen Zhang
Guozhen Zhang
Citations: 404
h-index: 9
Xuerui Qiu
Xuerui Qiu
Citations: 48
h-index: 4
Yutao Cui
Yutao Cui
Citations: 265
h-index: 3
Tian-Shu Song
Tian-Shu Song
Citations: 225
h-index: 5
Xiao Zhang
Xiao Zhang
Citations: 48
h-index: 4
Yang Li
Yang Li
Citations: 1
h-index: 1
Jianbin Wu
Jianbin Wu
Citations: 6
h-index: 2
Limin Wang
Limin Wang
Citations: 145
h-index: 5

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

0 Citations
0 Influential
4.5 Altmetric
22.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!