2606.12106v1 Jun 10, 2026 cs.CV

MSUE: Multi-Modal Soccer Understanding Expert

Ji-rong Wen

Citations: 25,817

h-index: 58

Litao Li

Citations: 10

h-index: 2

Yixi Zhou

Citations: 18

h-index: 2

Yi Yu

Citations: 44

h-index: 5

Yufeng Hu

Citations: 17

h-index: 2

Zhuoran Yang

Citations: 80

h-index: 2

Yixin Chen

Citations: 451

h-index: 5

This paper presents our solution to the 2026 SoccerNet VQA Challenge. We first develop a cost-effective data synthesis pipeline driven by a Vision-Language Model (VLM), which systematically restructures raw domain data into diverse VQA samples, including concise answers and long-form responses. Second, we propose MSUE, a multi-expert question answering architecture that employs a Large Language Model (LLM) to dynamically dispatch questions to text, image, and video experts. These experts are instantiated as a strong text baseline Gemini3-Flash, a fine-tuned Qwen3-VL, and an external knowledge base, respectively, working collaboratively to enhance VQA performance. MSUE achieves an accuracy of \textbf{0.95} on the challenge benchmark, securing third place in the leaderboard.

0 Citations

0 Influential

29 Altmetric

145.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!