2601.16225v1 Jan 16, 2026 eess.AS

ES4R: 사전 감정 모델링 기반 음성 인코딩을 통한 공감적 응답 생성

ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation

Shi Feng

Citations: 67

h-index: 5

Xiaocui Yang

Citations: 682

h-index: 12

Daling Wang

Citations: 2,384

h-index: 24

Yifei Zhang

Citations: 1,664

h-index: 20

Zhuoyue Gao

Citations: 0

h-index: 0

Xiaohui Wang

Citations: 69

h-index: 4

Wen Zhang

Citations: 0

h-index: 0

공감적인 음성 대화는 언어적 내용의 이해뿐만 아니라 음조, 억양, 감정 강도와 같은 풍부한 비언어적 정보의 인식을 통해 이루어집니다. 기존의 음성-텍스트 대규모 언어 모델은 ASR 전사 방식을 사용하거나, 인코더를 사용하여 잠재적 표현을 추출하는 경우가 많으며, 이는 다중 턴 대화에서 감정 정보와 문맥적 일관성을 약화시키는 경향이 있습니다. 이러한 문제를 해결하기 위해, 음성 기반 공감적 응답 생성을 위한 프레임워크인 extbf{ES4R}을 제안합니다. 우리의 핵심 혁신은 인코더가 암묵적으로 학습하거나 명시적인 감정 감독을 사용하는 대신, 음성 인코딩 전에 구조화된 감정적 맥락을 명시적으로 모델링하는 것입니다. 구체적으로, 턴 수준의 감정 상태와 대화 수준의 감정적 역동성을 포착하기 위해 이중 수준의 어텐션 메커니즘을 도입했습니다. 결과적으로 생성된 감정 표현은 음성을 활용한 교차 모달 어텐션을 통해 텍스트 의미와 통합되어 공감적인 응답을 생성합니다. 음성 출력의 경우, 에너지 기반 전략 선택과 스타일 융합을 사용하여 공감적인 음성 합성을 달성합니다. ES4R은 자동 평가 및 인간 평가 모두에서 강력한 기준 모델보다 뛰어난 성능을 보이며, 다양한 LLM 백본에서도 안정적인 성능을 유지합니다.

Original Abstract

Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information such as prosody, tone, and emotional intensity for affective understandings. Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations, often weakening affective information and contextual coherence in multi-turn dialogues. To address this, we propose \textbf{ES4R}, a framework for speech-based empathetic response generation. Our core innovation lies in explicitly modeling structured affective context before speech encoding, rather than relying on implicit learning by the encoder or explicit emotion supervision. Specifically, we introduce a dual-level attention mechanism to capture turn-level affective states and dialogue-level affective dynamics. The resulting affective representations are then integrated with textual semantics through speech-guided cross-modal attention to generate empathetic responses. For speech output, we employ energy-based strategy selection and style fusion to achieve empathetic speech synthesis. ES4R consistently outperforms strong baselines in both automatic and human evaluations and remains robust across different LLM backbones.

0 Citations

0 Influential

12 Altmetric

60.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!