2602.08057v1 Feb 08, 2026 cs.CV

약한 데이터에서 강한 성능으로: 다중 모드 비디오 기반 감정 인식 태스크에서 VLM 기반 가짜 레이블링을 활용한 약한 지도 학습 전략

Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks

Haixu Liu

Citations: 6

h-index: 1

Yufei Wang

Citations: 2

h-index: 1

Tianxiang Xu

Citations: 14

h-index: 2

Chuancheng Shi

Citations: 2

h-index: 1

Hongsheng Xing

Citations: 2

h-index: 1

본 논문에서는 비디오에서 "숨겨진 감정"을 자동으로 인식하기 위해 다중 모드 약한 지도 학습 프레임워크를 제안하고, iMiGUE 테니스 인터뷰 데이터셋에서 최첨단 결과를 달성했습니다. 먼저, YOLO 11x를 사용하여 프레임별로 사람의 얼굴을 탐지하고 추출하며, DINOv2-Base를 사용하여 추출된 영역에서 시각적 특징을 추출합니다. 다음으로, Chain-of-Thought 및 Reflection 프롬프팅(CoT + Reflection)을 통합하여 Gemini 2.5 Pro가 자동으로 가짜 레이블과 추론 텍스트를 생성하며, 이는 하위 모델에 대한 약한 지도 역할을 합니다. 이후, OpenPose를 사용하여 137차원의 키포인트 시퀀스를 생성하고, 프레임 간 오프셋 특징을 추가합니다. 기존의 그래프 신경망(GNN) 기반 백본을 단순화된 다층 퍼셉트론(MLP)으로 대체하여 세 개의 키포인트 스트림의 시공간 관계를 효율적으로 모델링합니다. 초장기 시퀀스 트랜스포머는 이미지와 키포인트 시퀀스를 독립적으로 인코딩하고, 이들의 표현은 BERT로 인코딩된 인터뷰 기록과 연결됩니다. 각 모달리티는 먼저 독립적으로 사전 학습을 수행한 다음, 함께 미세 조정을 수행하며, 가짜 레이블이 지정된 샘플을 추가하여 학습 효과를 더욱 향상시킵니다. 실험 결과, 심각한 클래스 불균형에도 불구하고, 제안된 방법은 기존 연구에서 0.6 미만의 정확도를 보였던 것을 0.69 이상으로 향상시켜 새로운 공개 벤치마크를 설정했습니다. 또한, 본 연구는 "MLP 기반" 키포인트 백본이 이 태스크에서 GCN 기반 백본과 동등하거나 더 나은 성능을 보일 수 있음을 확인했습니다.

Original Abstract

To tackle the automatic recognition of "concealed emotions" in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and key-point sequences, and their representations are concatenated with BERT-encoded interview transcripts. Each modality is first pre-trained in isolation, then fine-tuned jointly, with pseudo-labeled samples merged into the training set for further gains. Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts accuracy from under 0.6 in prior work to over 0.69, establishing a new public benchmark. The study also validates that an "MLP-ified" key-point backbone can match - or even surpass - GCN-based counterparts in this task.

1 Citations

0 Influential

1 Altmetric

6.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!