2603.10827v1 Mar 11, 2026 cs.SD

음성 인지 LLM을 이용한 화자 인증: 평가 및 증강

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

N. Dehak

Citations: 12,785

h-index: 48

Thomas Thebaud

Citations: 112

h-index: 6

L. Moro-Velázquez

Citations: 1,718

h-index: 24

J. Villalba

Citations: 3,065

h-index: 29

Yuzhe Wang

Citations: 13

h-index: 2

음성 정보를 입력으로 받는 대규모 언어 모델(LLM)은 언어적 내용이나 감정, 화자의 성별 등 특정 분야에 대한 학습 목표를 주로 가지며, 화자 정보를 얼마나 잘 표현하는지는 명확하지 않습니다. 본 연구에서는 모델에 구애받지 않는 평가 프로토콜을 제안하여, API만 제공되는 모델과 공개 가중치 모델 모두에 대해 연속적인 인증 점수를 생성합니다. 이 프로토콜을 사용하여 최근의 음성 인지 LLM을 평가한 결과, 화자 구분이 미흡한 것을 확인했습니다 (VoxCeleb1 데이터셋에서 EER이 20% 이상). 또한, ECAPA-TDNN 화자 임베딩을 학습된 투영을 통해 LLM에 주입하고, LoRA 어댑터만 학습시키는 가벼운 증강 방법을 도입하여 LLM에 화자 인증 기능을 부여했습니다. TinyLLaMA-1.1B 모델에 적용한 결과, ECAPA-LLM은 VoxCeleb1-E 데이터셋에서 1.03%의 EER을 달성하여, 전용 화자 인증 시스템에 근접하면서도 자연어 인터페이스를 유지했습니다.

Original Abstract

Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.

1 Citations

0 Influential

24 Altmetric

121.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!