2601.22889v1 Jan 30, 2026 cs.CL

DiffuSpeech: 통합 음성-텍스트 확산 모델을 통한 침묵 속의 사고, 말로 표현된 답변

DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

Jie Tang

Citations: 28,116

h-index: 8

Yao Wang

Citations: 789

h-index: 10

Yuxuan Lou

Citations: 193

h-index: 5

Yang You

Citations: 129

h-index: 5

Ziming Wu

Citations: 10

h-index: 2

Yong Liu

Citations: 10

h-index: 1

Ying Ren

Citations: 7

h-index: 2

Fuming Lai

Citations: 10

h-index: 2

Shaobing Lian

Citations: 10

h-index: 2

현재의 음성 언어 모델은 명시적인 추론 없이 직접적으로 답변을 생성하므로, 음성이 생성된 후에는 수정할 수 없는 오류가 발생할 수 있습니다. 본 논문에서는 "침묵 속의 사고, 말로 표현된 답변"이라는 새로운 패러다임을 제시합니다. 이 패러다임에서는 음성 언어 모델이 음성 답변과 함께 내부 텍스트 추론을 생성하며, 이러한 추론 과정은 음성 품질에 영향을 미칩니다. 이를 구현하기 위해, 우리는 이해와 생성 모두를 지원하는 최초의 확산 기반 음성-텍스트 언어 모델인 exttt{DiffuSpeech}를 제안합니다. exttt{DiffuSpeech}는 기존의 자기 회귀 방식과 달리, 이 모델은 모달리티별 마스킹 스케줄을 사용하여 반복적인 노이즈 제거 과정을 통해 추론 과정과 음성 토큰을 동시에 생성합니다. 또한, 본 논문에서는 26,000개의 샘플(총 319시간)을 포함하는 최초의 음성 질의응답 데이터셋인 exttt{Dataset}을 구축했습니다. 실험 결과, exttt{DiffuSpeech}는 최첨단 음성-음성 질의응답 정확도를 달성했으며, 최고 성능의 기존 모델보다 최대 9%p 더 높은 성능을 보였습니다. 또한, 생성 모델 중에서 최고의 TTS 품질(6.2% WER)을 달성하고, 언어 이해 능력(66.2% MMLU)을 유지했습니다. 추가 분석 결과, 확산 아키텍처와 추론 과정 모두 이러한 성능 향상에 기여하는 것으로 확인되었습니다.

Original Abstract

Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!