2606.06357v1 Jun 04, 2026 cs.SD

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

Dinghao Zhou

Citations: 25

h-index: 3

Xingchen Song

Citations: 29

h-index: 3

Di Wu

Citations: 32

h-index: 3

Peng Cheng

Citations: 11

h-index: 2

Sixian Lv

Citations: 5

h-index: 1

Shengfan Shen

Citations: 2

h-index: 1

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!