2603.09714v1 Mar 10, 2026 cs.SD

MUGEN: 대규모 오디오-언어 모델의 다중 오디오 이해 능력 평가 및 개선

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Chih-Kai Yang

Citations: 389

h-index: 12

Yun-Shao Tsai

Citations: 91

h-index: 2

Yu-Kai Guo

Citations: 1

h-index: 1

Ping-Le Tsai

Citations: 52

h-index: 1

Yen-Ting Piao

Citations: 44

h-index: 2

Hung-Wei Chen

Citations: 134

h-index: 4

Ting-Lin Hsiao

Citations: 1

h-index: 1

Yun-Man Hsu

Citations: 1

h-index: 1

Ke-Han Lu

Citations: 518

h-index: 13

Hung-yi Lee

Citations: 124

h-index: 4

대규모 오디오-언어 모델(LALM)에서 다중 오디오 이해는 매우 중요하지만, 아직 충분히 연구되지 않았습니다. 본 논문에서는 음성, 일반 오디오 및 음악을 포괄하는 다중 오디오 이해 능력을 평가하는 종합적인 벤치마크인 MUGEN을 소개합니다. 우리의 실험 결과는 다중 오디오 환경에서 일관된 약점을 드러내며, 동시에 입력되는 오디오의 수가 증가함에 따라 성능이 급격히 저하되는 것을 보여줍니다. 이는 입력 스케일링이 근본적인 병목 현상임을 나타냅니다. 또한, 학습 없이 적용할 수 있는 전략을 조사한 결과, 오디오 후보의 순서를 다양화하는 Audio-Permutational Self-Consistency 기법이 모델이 더욱 견고한 예측을 수행하도록 돕는다는 것을 확인했으며, 이를 통해 최대 6.28%의 정확도 향상을 얻을 수 있었습니다. 이 순열 전략을 Chain-of-Thought와 결합하면 성능이 더욱 향상되어 6.74%에 도달했습니다. 이러한 결과는 현재 LALM의 한계를 드러내며, 복잡한 청각 이해 능력을 평가하기 위한 기반을 제공합니다.

Original Abstract

While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.

1 Citations

0 Influential

6.5 Altmetric

33.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!