2601.13948v3 Jan 20, 2026 eess.AS

Stream-Voice-Anon: 신경망 오디오 코덱과 언어 모델을 활용한 실시간 화자 익명화 유틸리티 향상

Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models

N. Kuzmin

Citations: 70

h-index: 3

Songting Liu

Citations: 76

h-index: 3

Kong Aik Lee

Citations: 122

h-index: 4

Chng Eng Siong

Citations: 587

h-index: 13

온라인 음성 애플리케이션에서 화자 식별 보호는 매우 중요하지만, 실시간 화자 익명화(SA)는 아직 연구가 부족한 분야입니다. 최근 연구에 따르면 신경망 오디오 코덱(NAC)은 우수한 화자 특징 분리 능력과 언어적 충실도를 제공합니다. NAC는 또한 인과 관계 언어 모델(LM)과 함께 사용하여 스트리밍 작업의 언어적 충실도를 높이고 제어 기능을 향상시킬 수 있습니다. 그러나 기존의 NAC 기반 온라인 LM 시스템은 음성 변환(VC)을 위해 설계되었으며, 개인 정보 보호에 필요한 기술이 부족합니다. 본 연구는 이러한 발전 사항을 바탕으로, 최신 인과 관계 LM 기반 NAC 아키텍처를 화자 익명화 기술과 통합하여 스트리밍 SA에 특화된 Stream-Voice-Anon을 제안합니다. 저희의 익명화 접근 방식은 양자화된 콘텐츠 코드의 분리 특성을 활용하여 화자 정보 유출을 방지하는 가짜 화자 표현 샘플링, 화자 임베딩 혼합, 그리고 LM 조건부 학습을 위한 다양한 프롬프트 선택 전략을 포함합니다. 또한, 실시간 시나리오에서의 지연-개인 정보 보호 균형을 탐색하기 위해 동적 및 고정 지연 구성을 비교합니다. VoicePrivacy 2024 챌린지 프로토콜에 따르면, Stream-Voice-Anon은 이전 최고 성능의 스트리밍 방법인 DarkStream과 비교하여 음성 명료도(최대 46% 상대적 WER 감소) 및 감정 보존(최대 28% 상대적 UAR) 측면에서 상당한 개선을 달성했으며, 비슷한 지연 시간(180ms vs 200ms)과 'lazy-informed' 공격자에 대한 유사한 수준의 개인 정보 보호를 제공합니다. 하지만 'semi-informed' 공격자에 대해서는 15% 상대적인 성능 저하가 관찰되었습니다.

Original Abstract

Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.

3 Citations

0 Influential

6.5 Altmetric

35.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!