2603.05887v1 Mar 06, 2026 eess.AS

재구성하라, 인코딩하지 마라: 높은 음성 명료도와 낮은 지연 시간을 위한 자기 지도 기반 표현 재구성 손실 함수를 활용한 스트리밍 신경망 오디오 코덱

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

Jihwan Lee

Citations: 68

h-index: 5

Shrikanth S. Narayanan

Citations: 14

h-index: 2

N. Dehak

Citations: 12,578

h-index: 48

Thomas Thebaud

Citations: 113

h-index: 5

L. Moro-Velázquez

Citations: 1,631

h-index: 23

Junhyeok Lee

Citations: 305

h-index: 8

Xiluo He

Citations: 29

h-index: 2

Helin Wang

Citations: 47

h-index: 4

J. Villalba

Citations: 2,970

h-index: 28

멜-스펙트로그램 재구성을 최적화하도록 설계된 신경망 오디오 코덱은 종종 음성 명료도를 유지하는 데 실패합니다. 의미론적 인코더 증류는 인코딩된 표현을 개선하지만, 재구성된 음성에서 콘텐츠 보존을 보장하지는 않습니다. 본 연구에서는 자기 지도 기반 표현 재구성 (SSRR) 손실 함수가 코덱 훈련 및 성능을 근본적으로 향상시킨다는 것을 보여줍니다. 첫째, SSRR은 수렴 속도를 크게 가속화하여 단일 GPU만 사용하여 경쟁력 있는 결과를 얻을 수 있습니다. 둘째, 코덱 출력으로부터 증류된 자기 지도 표현을 재구성함으로써 음성 명료도를 향상시킵니다. 셋째, SSRR은 추가적인 예측 없이 스트리밍 트랜스포머 기반 코덱에서 높은 명료도를 가능하게 하여 실시간 배포를 위한 제로-루크어헤드 아키텍처를 지원합니다. 결과적으로, 저희의 JHCodec은 최첨단 성능을 달성하면서 최소한의 지연 시간과 감소된 훈련 비용을 유지합니다. 저희는 전체 구현, 훈련 파이프라인 및 데모를 Github (https://github.com/jhcodec843/jhcodec)에 공개합니다.

Original Abstract

Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that self-supervised representation reconstruction (SSRR) loss fundamentally improves codec training and performance. First, SSRR significantly accelerates convergence, enabling competitive results using only a single GPU. Second, it enhances intelligibility by reconstructing distilled self-supervised representations from codec outputs. Third, SSRR enables high intelligibility without additional lookahead in streaming Transformer-based codecs, allowing a zero-lookahead architecture for real-time deployment. As a result, our JHCodec achieves state-of-the-art performance while maintaining minimal latency and reduced training cost. We open-source the full implementation, training pipeline, and demo on Github https://github.com/jhcodec843/jhcodec.

0 Citations

0 Influential

44 Altmetric

220.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!