2604.23860v1 Apr 26, 2026 cs.CV

이기적 시점 영상 이해에서의 음향 환각 현상 연구

Exploring Audio Hallucination in Egocentric Video Understanding

Xinhao Mei

Citations: 2,208

h-index: 15

Varun Nagaraja

Citations: 41

h-index: 4

Yangyang Shi

Citations: 538

h-index: 9

Changsheng Zhao

Meta

Citations: 1,434

h-index: 10

Ernie Chang

Citations: 436

h-index: 6

Vikas Chandra

Citations: 1,902

h-index: 15

Yunyang Xiong

Citations: 2,032

h-index: 10

Ashish Seth

Citations: 676

h-index: 9

Gregory P. Meyer

Citations: 12

h-index: 1

Gaël Le Lan

Citations: 15

h-index: 2

Dinesh Manocha

Citations: 48

h-index: 2

Zhipeng Cai

Citations: 12

h-index: 1

이기적 시점 영상은 사용자의 활동과 주변 환경을 이해하는 데 중요한 단서를 제공하는 독특한 환경을 제공하며, 특히 카메라의 지속적인 움직임으로 인해 시각 정보가 불안정하거나 가려질 때 더욱 그렇습니다. 최첨단 대규모 오디오-시각 언어 모델(AV-LLM)은 다중 모드 설명을 생성할 수 있습니다. 그러나 본 연구에서는 이러한 모델들이 음향 환각에 취약하며, 보이는 시각적 단서로부터 실제로 들리지 않는 소리를 추론하는 경우가 많다는 것을 보여줍니다. 본 연구에서는 이기적 시점 영상에서 음향 환각을 분석하기 위한 체계적이고 자동화된 평가 프레임워크를 제시합니다. 300개의 이기적 시점 영상을 수집하고, 모델의 출력 결과를 평가하기 위해 1,000개의 음향 관련 질문을 설계했습니다. 환각 현상을 특징짓기 위해, 사용자 활동에 따른 전경 소리와 배경 주변 소음을 구별하는 체계적인 분류법을 제안합니다. 평가 결과, Qwen2.5 Omni와 같은 고급 AV-LLM은 높은 환각 발생률을 보였으며, 전경 및 배경 소리에 관련된 질문에 대해 각각 27.3%와 39.5%의 정확도를 기록했습니다. 본 연구를 통해 다중 모드 응답의 신뢰성을 측정해야 할 필요성을 강조하며, 신뢰할 수 있는 AV-LLM을 개발하기 위해서는 환각 현상에 대한 견고한 평가가 필수적임을 강조합니다.

Original Abstract

Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard. We present a systematic and automatic evaluation framework for analyzing audio hallucinations in egocentric video through a targeted question-answering (Q/A) protocol. We curate a dataset of 300 egocentric videos and design 1,000 sound-focused questions to probe model outputs. To characterize hallucinations, we propose a grounded taxonomy that distinguishes between foreground action sounds from the user activities and background ambient sounds. Our evaluation shows that advanced AV-LLMs, such as Qwen2.5 Omni, exhibit high hallucination rates, achieving only 27.3% and 39.5% accuracy on Q/As related to foreground and background sounds, respectively. With this work, we highlight the need to measure the reliability of multimodal responses, emphasizing that robust evaluation of hallucinations is essential to develop reliable AV-LLMs.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!