2606.12106v1 Jun 10, 2026 cs.CV

MSUE: Multi-Modal Soccer Understanding Expert

Ji-rong Wen
Ji-rong Wen
Citations: 25,817
h-index: 58
Litao Li
Litao Li
Citations: 10
h-index: 2
Yixi Zhou
Yixi Zhou
Citations: 18
h-index: 2
Yi Yu
Yi Yu
Citations: 44
h-index: 5
Yufeng Hu
Yufeng Hu
Citations: 17
h-index: 2
Zhuoran Yang
Zhuoran Yang
Citations: 80
h-index: 2
Yixin Chen
Yixin Chen
Citations: 451
h-index: 5

This paper presents our solution to the 2026 SoccerNet VQA Challenge. We first develop a cost-effective data synthesis pipeline driven by a Vision-Language Model (VLM), which systematically restructures raw domain data into diverse VQA samples, including concise answers and long-form responses. Second, we propose MSUE, a multi-expert question answering architecture that employs a Large Language Model (LLM) to dynamically dispatch questions to text, image, and video experts. These experts are instantiated as a strong text baseline Gemini3-Flash, a fine-tuned Qwen3-VL, and an external knowledge base, respectively, working collaboratively to enhance VQA performance. MSUE achieves an accuracy of \textbf{0.95} on the challenge benchmark, securing third place in the leaderboard.

0 Citations
0 Influential
29 Altmetric
145.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!