2603.09853v1 Mar 10, 2026 cs.SD

SCENEBench: 보조 기술 및 산업 응용 사례 기반의 오디오 이해 벤치마크

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Sanmi Koyejo

Citations: 3,864

h-index: 23

Laya Iyer

Citations: 31

h-index: 2

Angelina Wang

Citations: 148

h-index: 5

대규모 언어 모델(LLM)의 발전은 오디오 처리 분야에 상당한 발전을 가져왔으며, 현재 Large Audio Language Models (LALM)이라고 알려진 최첨단 모델들이 등장했습니다. 그러나 자동 음성 인식(ASR)을 넘어 오디오 이해 능력을 측정하는 연구는 미미한 수준입니다. 본 논문에서는 SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation)라는 벤치마크 스위트를 제안하여, 배경 소리 이해, 소음 위치 추정, 다국어 음성 이해, 음성 특징 인식의 네 가지 실제 응용 분야에 걸쳐 광범위한 오디오 이해 능력을 평가합니다. 이러한 네 가지 범주는 접근성 기술 및 산업용 소음 모니터링 분야에서 간과되어 온 요구 사항을 기반으로 선정되었습니다. 성능뿐만 아니라 모델의 지연 시간도 측정합니다. 본 벤치마크 스위트는 단순히 어떤 단어가 말해지는 것 이상으로, 어떻게 말해지는지, 그리고 오디오의 비음성 요소까지 평가하는 것을 목표로 합니다. 저희는 오디오 샘플을 인공적으로 생성했습니다(예: 두 개의 자연 오디오 샘플을 겹쳐서 생성). 따라서 저희 벤치마크의 생태적 타당성을 평가하기 위해, 기존 데이터 세트에서 추출한 20개의 자연 오디오 항목을 각 작업별로 부분적으로 추출하여 저희 작업 기준에 맞게 사용했습니다. 저희는 다섯 가지 최첨단 LALM 모델을 평가한 결과, 중요한 문제점이 발견되었습니다. 모델의 성능은 작업에 따라 크게 달라지며, 일부 작업에서는 무작위 추측보다 낮은 성능을 보이고, 다른 작업에서는 높은 정확도를 달성합니다. 이러한 결과는 모델의 기능 향상을 위한 구체적인 방향을 제시합니다.

Original Abstract

Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what words are said - rather, how they are said and the non-speech components of the audio. Because our audio samples are synthetically constructed (e.g., by overlaying two natural audio samples), we further validate our benchmark against 20 natural audio items per task, sub-sampled from existing datasets to match our task criteria, to assess ecological validity. We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. These results provide direction for targeted improvements in model capabilities.

0 Citations

0 Influential

11.5 Altmetric

57.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!