2604.19221v1 Apr 21, 2026 cs.AI

UAF: 풀-듀플렉스 음성 상호 작용을 위한 통합 오디오 프론트엔드 LLM

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

Biye Li

Citations: 220

h-index: 4

Yadong Li

Citations: 222

h-index: 6

Guoxin Wu

Citations: 38

h-index: 3

Haiping Hou

Citations: 0

h-index: 0

풀-듀플렉스 음성 상호 작용은 인간의 가장 자연스럽고 직관적인 의사소통 방식으로, 인공지능을 더욱 인간과 유사한 대화형 시스템으로 발전시키는 원동력입니다. 기존의 여러 단계로 구성된 음성 처리 파이프라인은 누적 지연, 정보 손실, 그리고 모듈 간의 오류 전파와 같은 심각한 한계를 가지고 있습니다. 이러한 문제점을 해결하기 위해 최근에는 GPT-4o와 같은 엔드-투-엔드 오디오 대규모 언어 모델(LLM) 개발이 진행되고 있으며, 이는 주로 음성 이해와 생성 작업을 통합하는 데 중점을 둡니다. 그러나 대부분의 이러한 모델은 본질적으로 반-듀플렉스이며, 음성 활동 감지(VAD) 및 발화 교대 감지(TD)와 같은 별도의, 특정 작업에 특화된 프론트엔드 구성 요소를 필요로 합니다. 음성 비서 개발 과정에서 우리는 완벽하고 즉각적인 상호 작용을 달성하기 위해서는 통합 모델의 성능 향상만큼 음성 프론트엔드 최적화도 매우 중요하다는 것을 확인했습니다. 이러한 간극을 해소하기 위해, 본 논문에서는 풀-듀플렉스 음성 시스템을 위해 설계된 최초의 통합 오디오 프론트엔드 LLM(UAF)을 제안합니다. 저희 모델은 VAD, TD, 화자 인식(SR), 자동 음성 인식(ASR) 및 질의응답(QA)을 포함한 다양한 오디오 프론트엔드 작업을 단일한 자기 회귀 시퀀스 예측 문제로 재구성합니다. 모델은 스트리밍되는 고정 길이 오디오 청크(예: 600ms)를 입력으로 사용하고, 목표 화자를 초기 단계에 고정하기 위해 참조 오디오 프롬프트를 활용하며, 의미론적 내용과 시스템 수준의 상태 제어(예: 인터럽트 신호)를 모두 인코딩하는 이산 토큰을 순차적으로 생성합니다. 실험 결과는 저희 모델이 여러 오디오 프론트엔드 작업에서 뛰어난 성능을 달성하며, 실제 상호 작용 시나리오에서 응답 지연 시간과 인터럽트 정확도를 크게 향상시킨다는 것을 보여줍니다.

Original Abstract

Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. Our model reformulates diverse audio front-end tasks into a single auto-regressive sequence prediction problem, including VAD, TD, speaker recognition (SR), automatic speech recognition (ASR) and question answer (QA). It takes streaming fixed-duration audio chunk (e.g., 600 ms) as input, leverages a reference audio prompt to anchor the target speaker at the beginning, and regressively generates discrete tokens encoding both semantic content and system-level state controls (e.g., interruption signals). Experiments demonstrate that our model achieves leading performance across multiple audio front-end tasks and significantly enhances response latency and interruption accuracy in real-world interaction scenarios.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!