2606.06357v1 Jun 04, 2026 cs.SD

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

Dinghao Zhou
Dinghao Zhou
Citations: 25
h-index: 3
Xingchen Song
Xingchen Song
Citations: 29
h-index: 3
Di Wu
Di Wu
Citations: 32
h-index: 3
Peng Cheng
Peng Cheng
Citations: 11
h-index: 2
Sixian Lv
Sixian Lv
Citations: 5
h-index: 1
Shengfan Shen
Shengfan Shen
Citations: 2
h-index: 1

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

0 Citations
0 Influential
1.5 Altmetric
7.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!