2602.05496v1 Feb 05, 2026 cs.MM

XEmoGPT: 큐 레벨 인식 및 추론을 갖춘 설명 가능한 다중 모드 감정 인식 프레임워크

XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

Hanwen Zhang

Citations: 28

h-index: 2

Yao Liu

Citations: 12

h-index: 2

Peiyuan Jiang

Citations: 26

h-index: 3

Junjie Lang

Citations: 18

h-index: 1

Yihui He

Citations: 81

h-index: 4

Yajiao Deng

Citations: 0

h-index: 0

Siyu Du

Citations: 4

h-index: 1

Qiao Liu

Citations: 12

h-index: 2

Jun Xie

Citations: 5

h-index: 1

설명 가능한 다중 모드 감정 인식은 인간-컴퓨터 상호 작용 및 소셜 미디어 분석과 같은 응용 분야에서 중요한 역할을 합니다. 그러나 현재 접근 방식은 큐 레벨 인식 및 추론에 어려움을 겪는데, 이는 주로 다음과 같은 두 가지 문제점 때문입니다. 1) 범용 모달리티 인코더는 미세한 감정적 신호에 대한 민감도가 제한되도록, 전반적인 구조 및 일반적인 의미를 포착하도록 사전 학습되므로 감정적 큐에 대한 세밀한 이해가 어렵습니다. 2) 기존 데이터 세트는 일반적으로 어노테이션 품질과 규모 간의 균형을 이루지 못하여 감정적 큐에 대한 충분한 감독을 제공하지 못하고, 결국 큐 레벨 추론을 제한합니다. 또한, 기존의 평가 지표는 큐 레벨 추론 성능을 평가하기에 적절하지 않습니다. 이러한 문제점을 해결하기 위해, 우리는 감정적 큐를 인식하고 추론할 수 있는 새로운 다중 모드 감정 인식 프레임워크인 eXplainable Emotion GPT (XEmoGPT)를 제안합니다. XEmoGPT는 세밀한 감정적 큐 인식을 위해 신중하게 설계된 작업들을 통해 비디오 및 오디오 인코더를 향상시키는 두 가지 특수 모듈인 Video Emotional Cue Bridge (VECB)와 Audio Emotional Cue Bridge (AECB)를 포함합니다. 또한, XEmoGPT가 다중 모드 감정적 큐에 대해 어떻게 추론하는지 학습할 수 있도록 설계된 대규모 데이터 세트인 EmoCue를 구축했습니다. 더불어, 의미 유사성을 사용하여 감정적 큐를 추출하고 매칭하는 자동화된 지표인 EmoCue-360을 소개하고, 다양한 감정 시나리오를 포괄하는 400개의 전문가 어노테이션 샘플로 구성된 벤치마크인 EmoCue-Eval을 공개합니다. 실험 결과는 XEmoGPT가 감정적 큐 인식 및 추론 모두에서 뛰어난 성능을 달성했음을 보여줍니다.

Original Abstract

Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!