2603.08683v1 Mar 09, 2026 cs.SD

고품질 오디오의 무손실 압축을 위한 언어 모델 성능 평가

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

Zachary Novack

Citations: 11

h-index: 1

Phillip Long

Citations: 14

h-index: 1

Chris Donahue

Citations: 267

h-index: 4

원시 파형 데이터로 학습된 자기회귀 "언어" 모델(LM)은 무손실 오디오 압축에 활용될 수 있지만, 기존 연구는 8비트 오디오에 국한되어 있어, 이러한 접근 방식이 실용적인 환경(16/24비트)에서 작동하는지, 그리고 기존 코덱과 경쟁할 수 있는지 여부는 미지수입니다. 본 연구에서는 다양한 영역(음악, 음성, 생체 음향), 샘플링 레이트(16kHz-48kHz), 그리고 비트 심도(8, 16, 24비트)에 걸쳐 언어 모델 기반 압축의 성능을 고품질 오디오 데이터에 대해 평가합니다. 표준적인 샘플 레벨 토큰화는 높은 비트 심도에서 어휘 크기(16비트의 경우 65K, 24비트의 경우 16.7M) 때문에 비효율적입니다. 우리는 고해상도 오디오를 위한 바이트 레벨 토큰화 방식인 Trilobyte를 제안하며, 이를 통해 어휘 크기 확장을 $O(2^{b})$에서 $O(1)$로 개선하여, 최초의 실용적인 24비트 언어 모델 기반 무손실 압축을 가능하게 합니다. 언어 모델은 FLAC보다 일관되게 우수한 성능을 보이며, 8비트 및 16비트 환경에서 최첨단 압축 성능을 제공하지만, 비트 심도가 8비트를 초과하면 압축 효율이 눈에 띄게 감소하는 것을 확인했습니다.

Original Abstract

Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!