2602.06602v1 Feb 06, 2026 cs.SD

디퓨전 오토인코더를 이용한 음성 토크나이저 확장

Scaling Speech Tokenizers with Diffusion Autoencoders

Arthur Hinsvark

Citations: 14,716

h-index: 8

Q. He

Citations: 0

h-index: 0

Yuancheng Wang

Citations: 1,211

h-index: 13

Zhenyu Tang

Citations: 65

h-index: 2

Yun Wang

Citations: 3

h-index: 1

Yingru Liu

Citations: 108

h-index: 3

Yinghao Li

Citations: 133

h-index: 7

Kainan Peng

Citations: 73

h-index: 3

Junyi Ao

The Chinese University of Hong Kong, Shenzhen

Citations: 607

h-index: 9

Mingbo Ma

Citations: 76

h-index: 3

Mike Seltzer

Citations: 45

h-index: 1

Xubo Liu

Citations: 8

h-index: 1

음성 토크나이저는 음성 언어 모델의 핵심 구성 요소이지만, 기존 방식은 다음과 같은 두 가지 주요 과제를 안고 있습니다. (1) 이해를 위한 의미 정보와 재구성을 위한 음향 정보 간의 균형을 맞추는 것, (2) 낮은 비트율과 낮은 토큰율을 달성하는 것. 본 논문에서는 음성 디퓨전 토크나이저(SiTok)를 제안합니다. SiTok은 지도 학습을 통해 의미적으로 풍부한 표현을 동시에 학습하고, 디퓨전을 활용하여 고품질 오디오 재구성을 가능하게 하는 디퓨전 오토인코더입니다. SiTok을 16억 개의 파라미터로 확장하고, 2백만 시간 분량의 음성 데이터를 사용하여 학습했습니다. 실험 결과, SiTok은 이해, 재구성 및 생성 작업에서 강력한 기준 모델을 능가하는 성능을 보였으며, 매우 낮은 토큰율(12.5 Hz)과 비트율(초당 200 비트)을 달성했습니다.

Original Abstract

Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!