2601.14620v1 Jan 21, 2026 eess.AS

음성 감정 인식에서 오디오-언어 모델을 활용한 인간 어노테이션 확장: 불확실성 확대를 통한 접근

Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

Ting Dang

Citations: 34

h-index: 1

Wenda Zhang

Citations: 2

h-index: 1

Hongyu Jin

Citations: 3

h-index: 1

Siyi Wang

Citations: 28

h-index: 3

Zhiqiang Wei

Citations: 13

h-index: 2

음성 감정 인식 모델은 일반적으로 단일 범주형 레이블을 사용하며, 인간 감정의 내재적인 불확실성을 간과합니다. '모호한 감정 인식(Ambiguous Emotion Recognition)'은 감정을 확률 분포로 표현하여 이러한 문제를 해결하고자 하지만, 희소한 인간 어노테이션에서 추론된 신뢰할 수 없는 ground truth 분포로 인해 발전이 제한적입니다. 본 논문에서는 대규모 오디오-언어 모델(ALM)이 고품질의 합성 어노테이션을 생성하여 어노테이션 병목 현상을 완화할 수 있는지 탐구합니다. 우리는 ALM을 활용하여 '합성 지각적 프록시(Synthetic Perceptual Proxies)'를 생성하는 프레임워크를 소개하고, 이를 통해 인간 어노테이션을 보완하여 ground truth 분포의 신뢰성을 향상시킵니다. 이러한 프록시의 유효성은 인간 분포와의 일치성을 통계적으로 분석하여 검증하고, ALM을 보완된 감정 분포로 미세 조정하여 그 효과를 평가합니다. 또한, 클래스 불균형 문제를 해결하고 편향되지 않은 평가를 가능하게 하기 위해, 분포 정보를 고려한 다중 모달 감정 증강 전략인 'DiME-Aug'를 제안합니다. IEMOCAP 및 MSP-Podcast 데이터셋에 대한 실험 결과, 합성 어노테이션은 감정 분포를 향상시키며, 특히 어노테이션 합의도가 높은 낮은 불확실성 영역에서 효과적입니다. 그러나 인간 간의 의견 불일치가 큰 높은 불확실성을 가진 감정에서는 그 효과가 감소합니다. 본 연구는 ALM이 불확실한 감정 인식에서 어노테이션 부족 문제를 해결할 수 있다는 최초의 증거를 제시하지만, 높은 불확실성을 가진 경우를 처리하기 위해서는 더욱 발전된 프롬프트 또는 생성 전략이 필요함을 강조합니다.

Original Abstract

Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions. Ambiguous Emotion Recognition addresses this by representing emotions as probability distributions, but progress is limited by unreliable ground-truth distributions inferred from sparse human annotations. This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations. We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability. We validate these proxies through statistical analysis of their alignment with human distributions and evaluate their impact by fine-tuning ALMs with the augmented emotion distributions. Furthermore, to address class imbalance and enable unbiased evaluation, we propose DiME-Aug, a Distribution-aware Multimodal Emotion Augmentation strategy. Experiments on IEMOCAP and MSP-Podcast show that synthetic annotations enhance emotion distribution, especially in low-ambiguity regions where annotation agreement is high. However, benefits diminish for highly ambiguous emotions with greater human disagreement. This work provides the first evidence that ALMs could address annotation scarcity in ambiguous emotion recognition, but highlights the need for more advanced prompting or generation strategies to handle highly ambiguous cases.

2 Citations

0 Influential

1.5 Altmetric

9.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!