2602.13891v1 Feb 14, 2026 cs.SD

GSRM: 음성 강화 학습(RLHF)을 위한 생성적 음성 보상 모델

GSRM: Generative Speech Reward Model for Speech RLHF

Maohao Shen

Citations: 64

h-index: 2

T. Jayashankar

Citations: 106

h-index: 6

Osama Hanna

Citations: 8

h-index: 1

Naoyuki Kanda

Citations: 6,783

h-index: 35

Yancheng Wang

Citations: 97

h-index: 2

Ruiming Xie

Citations: 25

h-index: 2

Niko Moritz

Citations: 1,502

h-index: 23

Anfeng Xu

Citations: 133

h-index: 6

Greg Wornell

Citations: 227

h-index: 9

Q. He

Citations: 0

h-index: 0

Jilong Wu

Citations: 133

h-index: 7

Yashesh Gaur

Citations: 16,572

h-index: 25

Katerina Zmolíková

Citations: 27

h-index: 3

GPT-4o Voice Mode 및 Gemini Live와 같은 최근 음성 언어 모델 발전은 유망한 음성 생성 능력을 보여주었습니다. 하지만, 생성된 오디오의 심미적 자연스러움은 여전히 인간의 음성과 비교하여 부족합니다. 생성 품질을 향상시키기 위해서는 음성 자연스러움을 신뢰성 있게 평가할 수 있는 도구가 필요합니다. 그러나 기존의 자연스러움 평가 도구는 일반적으로 원시 오디오를 스칼라 점수로 변환하여 평가 결과에 대한 해석 가능성이 제한적이며, 또한 다양한 음성 범주에 대한 일반화에 실패하는 경우가 많습니다. 최근 생성적 보상 모델링의 발전으로부터 영감을 받아, 음성에 특화된 추론 중심 보상 모델인 Generative Speech Reward Model (GSRM)을 제안합니다. GSRM은 음성 자연스러움 평가를 해석 가능한 음향 특징 추출 단계와 특징 기반의 단계별 추론으로 분해하여, 설명 가능한 판단을 가능하게 합니다. 이를 위해, 31,000건의 전문가 평가로 구성된 대규모 인간 피드백 데이터셋과 실제 사용자-어시스턴트 음성 상호 작용에 대한 외부 도메인 벤치마크를 구축했습니다. 실험 결과, GSRM은 기존의 음성 자연스러움 예측 모델보다 현저히 우수한 성능을 보였으며, 자연스러움 점수 예측에 대한 모델-인간 상관 관계가 인간 평가자 간 일관성에 근접하는 것을 확인했습니다. 또한, GSRM이 온라인 RLHF의 효과적인 검증 도구로서 음성 LLM 생성의 자연스러움을 향상시킬 수 있음을 보여줍니다.

Original Abstract

Recent advances in speech language models, such as GPT-4o Voice Mode and Gemini Live, have demonstrated promising speech generation capabilities. Nevertheless, the aesthetic naturalness of the synthesized audio still lags behind that of human speech. Enhancing generation quality requires a reliable evaluator of speech naturalness. However, existing naturalness evaluators typically regress raw audio to scalar scores, offering limited interpretability of the evaluation and moreover fail to generalize to speech across different taxonomies. Inspired by recent advances in generative reward modeling, we propose the Generative Speech Reward Model (GSRM), a reasoning-centric reward model tailored for speech. The GSRM is trained to decompose speech naturalness evaluation into an interpretable acoustic feature extraction stage followed by feature-grounded chain-of-thought reasoning, enabling explainable judgments. To achieve this, we curated a large-scale human feedback dataset comprising 31k expert ratings and an out-of-domain benchmark of real-world user-assistant speech interactions. Experiments show that GSRM substantially outperforms existing speech naturalness predictors, achieving model-human correlation of naturalness score prediction that approaches human inter-rater consistency. We further show how GSRM can improve the naturalness of speech LLM generations by serving as an effective verifier for online RLHF.

0 Citations

0 Influential

17.5 Altmetric

87.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!