2601.23066v1 Jan 30, 2026 cs.SD

음성 딥페이크 탐지를 위한 오디오 LLM의 명시적인 음향 증거 인식 연구

Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection

Xiaoxuan Guo

Citations: 9

h-index: 2

Jiayi Zhou

Citations: 13

h-index: 3

Jian Liu

Citations: 9

h-index: 2

Long Ye

Citations: 118

h-index: 5

Yuankun Xie

Citations: 5

h-index: 2

Haonan Cheng

Citations: 282

h-index: 8

Hengyan Huang

Citations: 7

h-index: 2

Qin Zhang

Citations: 5

h-index: 2

음성 딥페이크 탐지(SDD)는 주어진 음성 신호가 진본인지 또는 인공적으로 생성되었는지 식별하는 것을 목표로 합니다. 기존의 오디오 대규모 언어 모델(LLM) 기반 방법은 콘텐츠 이해에 뛰어난 성능을 보이지만, 예측이 종종 의미적으로 관련된 단서에 편향되어 있어, 의사 결정 과정에서 미세한 음향적 특징이 간과되는 경향이 있습니다. 그 결과, 자연스러운 의미를 가진 가짜 음성은 미묘한 음향적 이상을 내포하고 있음에도 불구하고 탐지기를 회피할 수 있으며, 이는 음향 데이터의 부재가 아니라, 의미 중심적인 추론이 지배될 때 음향 데이터의 접근성이 부족하기 때문임을 시사합니다. 이러한 문제를 해결하기 위해, 우리는 오디오 LLM 패러다임 내에서 SDD를 연구하고, 음향 인식 능력을 향상시킨 오디오 대규모 언어 모델(SDD-APALLM)이라는 음향 강화 프레임워크를 제안합니다. 이 프레임워크는 미세한 시간-주파수 정보를 명시적으로 드러내어 접근 가능한 음향적 단서로 활용합니다. 제안된 프레임워크는 원시 오디오와 구조화된 스펙트로그램을 결합하여, 오디오 LLM이 의미 이해를 저해하지 않으면서 미묘한 음향적 불일치를 보다 효과적으로 포착할 수 있도록 합니다. 실험 결과는 탐지 정확도와 견고성 측면에서 일관된 성능 향상을 보여주며, 특히 의미적 단서가 오해를 불러일으키는 경우에 더욱 두드러집니다. 추가 분석 결과, 이러한 개선은 단순한 모달리티 통합이 아닌, 의미 정보와 음향 정보의 조화로운 활용에 기인하는 것으로 나타났습니다.

Original Abstract

Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.

2 Citations

0 Influential

4 Altmetric

22.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!