2602.18527v1 Feb 20, 2026 cs.CV

JAEGER: 시뮬레이션된 물리적 환경에서의 3D 시청각 공동 그라운딩 및 추론

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Zhan Liu

Citations: 4

h-index: 1

Changli Tang

Citations: 960

h-index: 11

Yuxin Wang

Citations: 357

h-index: 4

Zhiyuan Zhu

Citations: 30

h-index: 3

Youjun Chen

Citations: 33

h-index: 3

Yiwen Shao

Citations: 73

h-index: 5

Tianzi Wang

Citations: 44

h-index: 3

Lei Ke

Citations: 80

h-index: 4

Zengrui Jin

Citations: 639

h-index: 14

Chao Zhang

Citations: 456

h-index: 13

현재의 시청각 대형 언어 모델(AV-LLM)은 주로 RGB 비디오와 단일 채널(모노) 오디오에 의존하며 2D 인식에 국한되어 있습니다. 이러한 설계 방식은 복잡한 3D 환경에서 신뢰할 수 있는 음원 위치 추정과 공간 추론을 방해하는 근본적인 차원의 불일치를 초래합니다. 본 연구에서는 AV-LLM을 3D 공간으로 확장하는 프레임워크인 JAEGER를 제시하여 이러한 한계를 해결하고자 하며, RGB-D 관측과 다중 채널 1차 앰비소닉스(first-order ambisonics)의 통합을 통해 공간에 대한 공동 그라운딩 및 추론을 가능하게 합니다. 우리 연구의 핵심 기여는 음원이 겹치는 열악한 음향 환경에서도 도달 방향(direction-of-arrival) 추정을 향상시키기 위해 강력한 방향 단서를 인코딩하는 학습된 공간 오디오 표현인 신경 강도 벡터(Neural IV)입니다. 대규모 학습과 체계적인 평가를 돕기 위해, 시뮬레이션된 물리적 환경에서 큐레이션된 6만 1천 개의 인스트럭션 튜닝(instruction-tuning) 샘플로 구성된 벤치마크인 SpatialSceneQA를 제안합니다. 광범위한 실험 결과, 제안된 접근법은 다양한 공간 인식 및 추론 작업 전반에서 2D 중심의 베이스라인 모델들을 일관되게 능가했으며, 이는 물리적 환경 내 AI 발전을 위해 명시적인 3D 모델링이 필수적임을 강조합니다. 소스 코드, 사전 학습된 모델 체크포인트 및 데이터셋은 논문 채택 시 공개될 예정입니다.

Original Abstract

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!