2605.29300v1 May 28, 2026 cs.CL

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

Yuki Mitsufuji
Yuki Mitsufuji
Citations: 48
h-index: 4
Shuyang Cui
Shuyang Cui
Citations: 9
h-index: 2
Daeyong Kwon
Daeyong Kwon
Citations: 13
h-index: 2
Qiyu Wu
Qiyu Wu
Citations: 16
h-index: 2
S. Kuriya
S. Kuriya
Citations: 86
h-index: 2
Junghyun Koo
Junghyun Koo
Sony AI / Sony Research
Citations: 350
h-index: 9
Zhichao Zhong
Zhichao Zhong
Citations: 2
h-index: 1
Wei-Hsiang Liao
Wei-Hsiang Liao
Citations: 18
h-index: 1
Hiromi Wakaki
Hiromi Wakaki
Citations: 111
h-index: 5

Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.

0 Citations
0 Influential
4.5 Altmetric
22.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!