2604.05007v1 Apr 06, 2026 cs.SD

양이식 차분 주목과 행동 전환 예측을 통한 일반화 가능한 오디오-비디오 내비게이션

Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

Jia Li

Citations: 233

h-index: 4

Yinfeng Yu

Xinjiang University

Citations: 269

h-index: 10

오디오-비디오 내비게이션(AVN)에서, 에이전트는 시각 및 청각 정보를 사용하여 3차원 환경 내에서 소리 발생 위치를 파악해야 합니다. 그러나 기존 방법들은 종종 새로운 환경에서의 일반화에 어려움을 겪으며, 이는 의미론적 소리 특징 및 특정 학습 환경에 과도하게 적합되기 때문입니다. 이러한 문제점을 해결하기 위해, 우리는 인지 및 정책을 동시에 최적화하는 **양이식 차분 주목과 행동 전환 예측(BDATP)** 프레임워크를 제안합니다. 구체적으로, **양이식 차분 주목(BDA)** 모듈은 공간 방향 감각을 향상시키기 위해 양이식 차이를 명시적으로 모델링하며, 이는 의미론적 범주에 대한 의존성을 줄입니다. 동시에, **행동 전환 예측(ATP)** 작업은 보조적인 행동 예측 목표를 도입하여 정규화 항 역할을 수행하며, 환경에 특화된 과적합을 완화합니다. Replica 및 Matterport3D 데이터셋에 대한 광범위한 실험 결과, BDATP가 다양한 기존 모델에 원활하게 통합되어 일관되고 상당한 성능 향상을 가져옴을 보여줍니다. 특히, 저희 프레임워크는 대부분의 설정에서 최고 수준의 성공률을 달성했으며, Replica 데이터셋에서 이전에 듣지 못했던 소리에 대해 최대 21.6%p의 놀라운 성능 향상을 보였습니다. 이러한 결과는 BDATP의 우수한 일반화 능력과 다양한 내비게이션 아키텍처에서의 견고성을 입증합니다.

Original Abstract

In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the \textbf{Binaural Difference Attention with Action Transition Prediction (BDATP)} framework, which jointly optimizes perception and policy. Specifically, the \textbf{Binaural Difference Attention (BDA)} module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the \textbf{Action Transition Prediction (ATP)} task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and Matterport3D datasets demonstrate that BDATP can be seamlessly integrated into various mainstream baselines, yielding consistent and significant performance gains. Notably, our framework achieves state-of-the-art Success Rates across most settings, with a remarkable absolute improvement of up to 21.6 percentage points in Replica dataset for unheard sounds. These results underscore BDATP's superior generalization capability and its robustness across diverse navigation architectures.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!