2605.25952v1 May 25, 2026 cs.CV

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

Yujiu Yang
Yujiu Yang
Citations: 463
h-index: 8
Yinghao Wu
Yinghao Wu
Citations: 62
h-index: 4
Zhuoyan Luo
Zhuoyan Luo
Citations: 311
h-index: 6
Yiyao Yu
Yiyao Yu
Citations: 165
h-index: 6
Zhaojian Yu
Zhaojian Yu
Citations: 108
h-index: 5
Xiao-Ping Zhang
Xiao-Ping Zhang
Citations: 177
h-index: 6

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

0 Citations
0 Influential
4 Altmetric
20.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!