2601.02983v1 Jan 06, 2026 cs.SD

주파수-시간 강화 학습을 이용한 오디오 LLM 기반의 해석 가능한 모든 유형의 오디오 딥페이크 탐지

Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

Yuankun Xie

Citations: 330

h-index: 11

Xiaoxuan Guo

Citations: 9

h-index: 2

Jiayi Zhou

Citations: 13

h-index: 3

Tao Wang

Citations: 489

h-index: 10

Jian Liu

Citations: 9

h-index: 2

Ruibo Fu

CASIA

Citations: 1,380

h-index: 16

Xiaopeng Wang

Citations: 182

h-index: 8

Haonan Cheng

Citations: 4

h-index: 1

Long Ye

Citations: 118

h-index: 5

최근 오디오 대규모 언어 모델(ALLM)의 발전으로 인해 고품질의 합성 오디오가 널리 보급되면서, 음성, 환경 소리, 노래, 음악 등 다양한 유형의 오디오 딥페이크로 인한 악의적인 위협이 증가하고 있습니다. 따라서 실제 오디오 딥페이크 탐지(ADD)는 다양한 유형의 오디오에 대한 일반화 성능을 제공하고 해석 가능한 의사 결정을 내릴 수 있는 모든 유형의 탐지기가 필요합니다. ALLM은 강력한 다중 작업 일반화 능력을 가지고 있으므로, 우리는 감독 학습(SFT)과 강화 학습(RFT) 모두를 사용하여 모든 유형의 ADD에서의 성능을 조사했습니다. 그러나 이진 형태의 실제/가짜 레이블만 사용하는 SFT는 모델을 블랙박스 분류기로 만들 가능성이 높으며, 이는 해석 가능성을 저해합니다. 반면, 희소 감독 하에 수행되는 일반적인 RFT는 보상 해킹에 취약하며, 환각적이고 근거 없는 설명을 생성할 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 주파수-시간 구조를 갖는 연쇄적 사고(Chain-of-Thought, CoT) 설명을 자동으로 생성하고 다듬는 파이프라인을 제안합니다. 이를 통해 약 34만 건의 초기 학습 데이터를 생성합니다. CoT 데이터를 기반으로, 우리는 주파수-시간 그룹 상대 정책 최적화(FT-GRPO)라는 두 단계의 학습 패러다임을 제안합니다. FT-GRPO는 먼저 ALLM을 SFT로 초기 학습하고, 그 후 규칙 기반의 주파수-시간 제약 조건 하에서 GRPO를 적용합니다. 실험 결과, FT-GRPO는 모든 유형의 ADD에서 최첨단 성능을 달성했으며, 동시에 해석 가능하고 주파수-시간에 기반한 설명을 생성하는 것을 보여줍니다. 데이터와 코드는 온라인에서 제공됩니다.

Original Abstract

Recent advances in audio large language models (ALLMs) have made high-quality synthetic audio widely accessible, increasing the risk of malicious audio deepfakes across speech, environmental sounds, singing voice, and music. Real-world audio deepfake detection (ADD) therefore requires all-type detectors that generalize across heterogeneous audio and provide interpretable decisions. Given the strong multi-task generalization ability of ALLMs, we first investigate their performance on all-type ADD under both supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). However, SFT using only binary real/fake labels tends to reduce the model to a black-box classifier, sacrificing interpretability. Meanwhile, vanilla RFT under sparse supervision is prone to reward hacking and can produce hallucinated, ungrounded rationales. To address this, we propose an automatic annotation and polishing pipeline that constructs Frequency-Time structured chain-of-thought (CoT) rationales, producing ~340K cold-start demonstrations. Building on CoT data, we propose Frequency Time-Group Relative Policy Optimization (FT-GRPO), a two-stage training paradigm that cold-starts ALLMs with SFT and then applies GRPO under rule-based frequency-time constraints. Experiments demonstrate that FT-GRPO achieves state-of-the-art performance on all-type ADD while producing interpretable, FT-grounded rationales. The data and code are available online.

4 Citations

0 Influential

8 Altmetric

44.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!