2601.06329v1 Jan 09, 2026 cs.CL

음성 언어 모델 평가 시 전역 토큰 퍼플렉시티의 오류에 대하여

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Jeff Chan-Jan Sju

Citations: 0

h-index: 0

Liang-Hsuan Tseng

Citations: 102

h-index: 6

Yi-Cheng Lin

National Taiwan University

Citations: 534

h-index: 12

Ju-Chieh Chou

Citations: 1,921

h-index: 6

Kai-Wei Chang

MIT CSAIL

Citations: 596

h-index: 13

Hung-yi Lee

Citations: 16

h-index: 1

Carlos Busso

Citations: 22

h-index: 4

Yen-Chun Kuo

Citations: 21

h-index: 2

대규모 원시 오디오 데이터로 사전 훈련된 생성형 음성 언어 모델은 화자나 감정과 같은 속성을 유지하면서 음성 프롬프트를 적절한 내용으로 이어갈 수 있으며, 음성 대화의 기초 모델로 활용될 수 있습니다. 기존 연구에서는 이러한 모델들이 종종 "전역 토큰 퍼플렉시티"를 사용하여 평가되는데, 이는 텍스트 퍼플렉시티 공식을 음성 토큰에 직접 적용하는 방식입니다. 그러나 이러한 방식은 음성과 텍스트 모달리티의 근본적인 차이점을 간과할 수 있으며, 결과적으로 음성 특징을 과소평가할 수 있습니다. 본 연구에서는 전역 토큰 퍼플렉시티의 단점을 보완할 수 있는 다양한 가능도 기반 및 생성 기반 평가 방법을 제안합니다. 제안된 평가 방법은 인간 평가 점수(MOS)와의 상관관계가 더 높다는 점에서 생성 품질을 더욱 정확하게 반영한다는 것을 입증했습니다. 새로운 지표를 통해 음성 언어 모델의 상대적인 성능이 재평가되었으며, 최고 성능 모델과 인간 수준의 성능 간의 격차가 크게 줄어들었습니다. 이러한 결과들은 음성 언어 모델링 분야의 발전을 정확하게 평가하는 데 적절한 평가 방법이 중요하다는 것을 시사합니다.

Original Abstract

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!