2601.02954v2 Jan 06, 2026 cs.SD

세상은 단일하지 않다: 대규모 오디오-언어 모델에서 공간 인지 능력 향상

The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

Lai Wei

Shanghai Jiao Tong University

Citations: 234

h-index: 7

Yuhuan You

Citations: 5

h-index: 1

Xihong Wu

Citations: 41

h-index: 4

T. Qu

Citations: 455

h-index: 11

기존의 대규모 오디오-언어 모델은 세상을 "단일 채널"로 인식하여, 보편적인 오디오 장면 분석(ASA)에 필수적인 중요한 공간 정보("어디")를 무시합니다. 이러한 격차를 해소하기 위해, 우리는 먼저 오디오 장면 분석을 위한 계층적 프레임워크를 제시합니다. 이 프레임워크에 따라, 우리는 대규모 오디오-언어 모델(LALM)이 복잡한 음향 환경을 이해하고 추론할 수 있도록 하는 시스템을 개발했습니다. 우리의 시스템은 네 가지 핵심적인 혁신을 통해 LALM에 보편적인 공간 인지 능력을 부여합니다: (1) 고품질 First-Order-Ambisonics(FOA) 데이터를 생성하는 확장 가능한 시뮬레이션 파이프라인; (2) 보편적인 공간 인코딩을 밀집형 하이브리드 투영 메커니즘과 통합하여 모달리티 간의 격차를 해소하는 통합 모델 프레임워크; (3) 표현 학습 정렬에서 강화 학습 기반 추론으로 점진적으로 발전하는 학습 커리큘럼; 그리고 (4) 원자적 인지, 관계적 통합 및 인지적 추론 능력을 엄격하게 평가하기 위해 설계된 오디오 장면 분석(ASA)을 위한 포괄적인 벤치마크입니다. 우리의 모델은 이 벤치마크에서 공간 인지 능력 측면에서 상대적으로 뛰어난 성능을 보였습니다. 우리의 연구는 LALM의 강력한 추론 능력을 활용하여 전체적인 ASA를 달성할 수 있는 명확한 경로를 제시하며, "단일 채널" 의미 인식에서 공간 지능으로 발전하는 데 기여합니다.

Original Abstract

Existing large audio-language models perceive the world as "mono"-a single stream of audio that ignores the critical spatial dimension ("where") required for universal audio scene analysis (ASA). To bridge this gap, we first introduce a hierarchical framework for audio scene analysis. Guided by this framework, we introduce a system that enables large audio-language models (LALMs) to understand and reason about the complex acoustic world. Our system endows LALMs with universal spatial understanding through four key innovations: (1) A scalable simulation pipeline that synthesizes high-quality First-Order-Ambisonics(FOA) data; (2) A unified model framework that integrates universal spatial encoding with a dense hybrid projection mechanism to bridge the modality gap; (3) A progressive training curriculum that evolves from representation alignment to reinforcement learning-based reasoning; and (4) A comprehensive benchmark for audio scene analysis (ASA) designed to rigorously evaluate atomic perception, relational integration, and cognitive reasoning capabilities, on which our model demonstrates comparatively strong capability for spatial understanding. Our work provides a clear pathway for leveraging the powerful reasoning abilities of LALMs towards holistic ASA, advancing from "mono" semantic recognition to spatial intelligence.

1 Citations

0 Influential

5.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!