2602.18763v1 Feb 21, 2026 cs.CV

TAG: 얼굴 표정 인식을 위한 액션 유닛 그라운딩 기반 사고

TAG: Thinking with Action Unit Grounding for Facial Expression Recognition

Wentao Zhang

Citations: 9

h-index: 2

H. Lin

Citations: 0

h-index: 0

Tianyi Bai

Citations: 207

h-index: 5

Jiajun Zhang

Citations: 3

h-index: 1

X. Chang

Citations: 0

h-index: 0

Sheng Lu

Citations: 1

h-index: 1

Fangming Gu

Citations: 1

h-index: 1

Zengjie Hu

Citations: 19

h-index: 1

얼굴 표정 인식(Facial Expression Recognition, FER)은 신뢰할 수 있는 예측을 위해 국소적이고 의미 있는 얼굴 단서에 대한 추론이 요구되는 세밀한 시각적 이해 작업이다. 최근 비전-언어 모델(VLM)은 FER에 대한 자연어 설명을 가능하게 하지만, 이들의 추론은 종종 시각적 근거가 부족하여(ungrounded), 시각적 증거와의 연관성이 약하고 환각(hallucination) 현상에 취약해 유창하지만 검증 불가능한 근거를 생성하며, 결과적으로 여러 데이터셋에 걸쳐 견고성을 저하시킨다. 본 논문은 다중 모달 추론이 얼굴 액션 유닛(Action Units, AU)의 지원을 받도록 명시적으로 제약하는 비전-언어 프레임워크인 TAG(Thinking with Action Unit Grounding)를 제안한다. TAG는 중간 추론 단계가 AU 관련 얼굴 영역에 그라운딩되도록 요구하며, 이를 통해 검증 가능한 시각적 증거가 수반된 예측을 산출한다. 이 모델은 먼저 AU에 그라운딩된 추론 과정에 대한 지도 미세 조정(supervised fine-tuning)을 통해 학습된 후, 예측된 영역을 외부 AU 탐지기와 정렬하는 AU 인식 보상(AU-aware reward) 기반의 강화 학습을 거친다. RAF-DB, FERPlus 및 AffectNet에서 평가된 TAG는 강력한 오픈 소스 및 클로즈드 소스 VLM 베이스라인을 일관되게 능가하는 동시에 시각적 충실도를 향상시켰다. 절제 및 선호도 연구는 AU 그라운딩 보상이 추론을 안정화하고 환각을 완화함을 추가로 보여주며, 이는 FER에서 신뢰할 수 있는 다중 모달 추론을 위해 구조화되고 그라운딩된 중간 표현이 중요함을 입증한다. 코드는 https://github.com/would1920/FER_TAG 에서 공개될 예정이다.

Original Abstract

Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision--language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision--language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement learning with an AU-aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF-DB, FERPlus, and AffectNet, TAG consistently outperforms strong open-source and closed-source VLM baselines while simultaneously improving visual faithfulness. Ablation and preference studies further show that AU-grounded rewards stabilize reasoning and mitigate hallucination, demonstrating the importance of structured grounded intermediate representations for trustworthy multimodal reasoning in FER. The code will be available at https://github.com/would1920/FER_TAG .

0 Citations

0 Influential

22.5 Altmetric

112.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!