2604.20842v1 Apr 22, 2026 cs.CL

SpeechParaling-Bench: 패럴링구이스틱 정보 인식 음성 생성 시스템을 위한 종합 벤치마크

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

Weiji Zhuang

Citations: 325

h-index: 7

Dong Zhang

Citations: 1,702

h-index: 14

Ruohan Liu

Citations: 5

h-index: 1

Tao Wang

Citations: 136

h-index: 3

Shuhuai Ren

Peking University

Citations: 3,944

h-index: 20

Ran He

Citations: 479

h-index: 5

Caifeng Shan

Citations: 481

h-index: 5

Chaoyou Fu

Citations: 783

h-index: 13

Shukang Yin

Citations: 1,920

h-index: 9

패럴링구이스틱 요소는 자연스러운 인간-컴퓨터 상호작용에 필수적이지만, 대규모 음성-언어 모델(LALMs)에서 이러한 요소의 평가는 제한적인 특징 표현 및 평가의 주관성 때문에 어려움을 겪고 있습니다. 이러한 문제점을 해결하기 위해, 패럴링구이스틱 정보를 인식하는 음성 생성 시스템을 위한 종합 벤치마크인 SpeechParaling-Bench를 제안합니다. 본 벤치마크는 기존의 50개 미만의 특징에서 100개 이상의 세분화된 특징으로 범위를 확장했으며, 1,000개 이상의 영어-중국어 병렬 음성 데이터 세트를 포함합니다. 또한, 세 가지로 구성된 점진적으로 난이도가 높아지는 작업(세분화된 제어, 발화 내 변동, 상황 인식 적응)으로 구성되어 있습니다. 신뢰성 있는 평가를 위해, 본 연구에서는 LALM 기반 평가기를 사용하여 후보 응답을 고정된 기준선과 비교하는 쌍대 비교 파이프라인을 개발했습니다. 평가를 절대적인 점수가 아닌 상대적인 선호도를 기준으로 함으로써, 주관성을 완화하고 비용이 많이 드는 인간 어노테이션 없이 더 안정적이고 확장 가능한 평가를 제공합니다. 광범위한 실험 결과, 현재의 LALM에 상당한 한계가 있음을 보여줍니다. 선도적인 독점 모델조차도 패럴링구이스틱 특징의 종합적인 정적 제어 및 동적 조절에 어려움을 겪으며, 상황 대화에서 패럴링구이스틱 요소의 잘못된 해석이 오류의 43.3%를 차지합니다. 이러한 결과는 인간과 더욱 조화로운 음성 지원 시스템을 개발하기 위한 더욱 강력한 패럴링구이스틱 모델링의 필요성을 강조합니다.

Original Abstract

Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech queries, and is organized into three progressively challenging tasks: fine-grained control, intra-utterance variation, and context-aware adaptation. To enable reliable evaluation, we further develop a pairwise comparison pipeline, in which candidate responses are evaluated against a fixed baseline by an LALM-based judge. By framing evaluation as relative preference rather than absolute scoring, this approach mitigates subjectivity and yields more stable and scalable assessments without costly human annotation. Extensive experiments reveal substantial limitations in current LALMs. Even leading proprietary models struggle with comprehensive static control and dynamic modulation of paralinguistic features, while failure to correctly interpret paralinguistic cues accounts for 43.3% of errors in situational dialogue. These findings underscore the need for more robust paralinguistic modeling toward human-aligned voice assistants.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!