2601.21463v1 Jan 29, 2026 cs.SD

사전 지식 강화 오디오 LLM을 활용한 음성 편집 탐지 및 콘텐츠 지역화의 통합

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

Yuankun Xie

Citations: 5

h-index: 2

Jun Xue

Citations: 2

h-index: 1

Yi Chai

Citations: 7

h-index: 2

Yanzhen Ren

Citations: 38

h-index: 3

Jinsheng He

Citations: 2

h-index: 1

Zhiqiang Tang

Citations: 177

h-index: 5

Zhuolin Yi

Citations: 2

h-index: 1

Yihuan Huang

Citations: 9

h-index: 2

Yujie Chen

Citations: 93

h-index: 5

음성 편집은 원본 발화에 대한 세밀한 구간 단위 조작을 수행하여 의미의 반전을 달성하는 동시에 전체적인 청각적 자연스러움을 유지합니다. 기존의 탐지 연구는 주로 명시적인 연결 부위(splice) 흔적이 있는 수동으로 편집된 음성에 초점을 맞추고 있으며, 따라서 원활한 음향 전환을 생성하는 새로운 엔드 투 엔드 신경망 음성 편집 기술에 대한 대응이 어렵습니다. 이러한 문제를 해결하기 위해, 우리는 대규모 이중 언어 데이터셋인 AiEdit을 구축합니다. AiEdit은 대규모 언어 모델을 활용하여 정교한 의미 조작 논리를 구현하고, 다양한 최첨단 신경망 음성 편집 방법을 사용하여 데이터 증강을 수행함으로써 고품질 음성 편집 데이터셋의 부족함을 해소합니다. 이 기반을 바탕으로, 우리는 음성 편집 탐지 및 콘텐츠 지역화를 오디오 질의응답 문제로 통합하는 최초의 대규모 모델 프레임워크인 PELM (Prior-Enhanced Audio Large Language Model)을 제안합니다. 기존의 오디오 대규모 모델에서 관찰되는 고유한 위조 편향 및 의미 우선순위 편향을 완화하기 위해, PELM은 단어 수준의 확률 사전 지식을 통합하여 명시적인 음향적 단서를 제공하고, 또한 미묘한 지역적 분포 이상을 명시적으로 모델링하도록 하는 중심점 집계 기반의 음향 일관성 인식 손실을 설계합니다. 광범위한 실험 결과는 PELM이 HumanEdit 및 AiEdit 데이터셋 모두에서 최첨단 방법보다 훨씬 뛰어난 성능을 보이며, 각각 0.57% 및 9.28%의 동일 오류율(EER)을 달성했음을 보여줍니다 (지역화).

Original Abstract

Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques that generate seamless acoustic transitions. To address this challenge, we first construct a large-scale bilingual dataset, AiEdit, which leverages large language models to drive precise semantic tampering logic and employs multiple advanced neural speech editing methods for data synthesis, thereby filling the gap of high-quality speech editing datasets. Building upon this foundation, we propose PELM (Prior-Enhanced Audio Large Language Model), the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task. To mitigate the inherent forgery bias and semantic-priority bias observed in existing audio large models, PELM incorporates word-level probability priors to provide explicit acoustic cues, and further designs a centroid-aggregation-based acoustic consistency perception loss to explicitly enforce the modeling of subtle local distribution anomalies. Extensive experimental results demonstrate that PELM significantly outperforms state-of-the-art methods on both the HumanEdit and AiEdit datasets, achieving equal error rates (EER) of 0.57\% and 9.28\% (localization), respectively.

2 Citations

0 Influential

2.5 Altmetric

14.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!