2603.27667v1 Mar 29, 2026 cs.SD

EvA: LALM을 위한 증거 우선 오디오 이해 패러다임

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

Shunian Chen

Citations: 1,325

h-index: 13

Xin Xie

Citations: 9

h-index: 2

Zhiheng Liu

Citations: 12

h-index: 2

Yuhao Zhang

Citations: 20

h-index: 3

Zhiqiang Lv

Citations: 55

h-index: 4

Liyin Liang

Citations: 0

h-index: 0

Benyou Wang

Citations: 143

h-index: 3

대규모 오디오 언어 모델(LALM)은 복잡한 음향 환경에서 여전히 어려움을 겪는데, 이는 추론이 시작되기 전에 관련 작업에 필요한 음향 정보를 제대로 보존하지 못하기 때문입니다. 우리는 이러한 실패를 '증거 병목 현상'이라고 부릅니다. 최첨단 시스템은 하위 작업 추론보다 음향 정보 추출에서 더 큰 성능 저하를 보이는 경향이 있는데, 이는 시스템의 주요 한계가 추론 정책이 아니라 상위 수준의 인식에 있다는 것을 시사합니다. 이러한 문제를 해결하기 위해, 우리는 Whisper와 CED-Base를 비압축적이고 시간적으로 정렬된 융합을 통해 결합하는 이중 경로 아키텍처인 EvA(Evidence-First Audio)를 제안합니다. EvA는 먼저 중간 CED 레이어를 결합하여 다중 스케일의 음향 정보를 보존하고, 결합된 CED 특징을 Whisper의 타임라인에 정렬하여 시퀀스 길이를 변경하지 않고 두 스트림을 추가합니다. 또한, 약 54,000개의 이벤트 순서에 따른 캡션(150시간)과 약 500,000개의 질의응답 쌍으로 구성된 대규모 오픈 소스 학습 데이터셋인 EvA-Perception을 구축했습니다. 통일된 제로샷 프로토콜 하에서, EvA는 MMAU, MMAR 및 MMSU에서 최고의 오픈 소스 인식 성능을 달성했으며, 보고된 모든 지표에서 Kimi-Audio-7B보다 성능이 향상되었으며, 특히 인식 관련 데이터셋에서 가장 큰 성능 향상을 보였습니다. 이러한 결과는 '증거 우선' 가설을 뒷받침합니다. 즉, 강력한 오디오 이해는 추론 전에 음향 정보를 보존하는 데 달려 있습니다.

Original Abstract

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasoning policy. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that combines Whisper and CED-Base through non-compressive, time-aligned fusion. EvA first aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns the aggregated CED features to the Whisper timeline and adds the two streams without changing sequence length. We also build EvA-Perception, a large-scale open-source training set with about 54K event-ordered captions (150 h) and about 500K QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source Perception scores on MMAU, MMAR, and MMSU, and improves over Kimi-Audio-7B on all reported metrics, with the largest gains on perception-heavy splits. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!