2602.21779v1 Feb 25, 2026 cs.CV

정적 특징을 넘어: 비전-언어 모델에서 동영상 딥페이크 분석을 위한 법의학적 벤치마크

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

Zheyuan Gu

Citations: 19

h-index: 3

Zhaohong Huang

Citations: 21

h-index: 2

Xinqi Li

Citations: 2

h-index: 1

Jiaowei Shao

Citations: 0

h-index: 0

Chi Zhang

Citations: 35

h-index: 4

Xuelong Li

Citations: 118

h-index: 5

Qingsong Zhao

Citations: 136

h-index: 4

Yusong Wang

Citations: 14

h-index: 1

Cheng Yuan

Citations: 52

h-index: 2

현재 딥페이크 탐지를 위한 비전-언어 모델(VLM)은 공간적 특징을 식별하는 데 뛰어난 성능을 보이지만, 동영상 위조의 중요한 측면인 시간적 불일치를 간과합니다. 이러한 동적인 단서를 이해하도록 VLM을 조정하는 것은 여전히 중요한 과제입니다. 이러한 격차를 해소하기 위해, 우리는 시간적 딥페이크 분석을 객관식 문제로 정의하는 대규모 벤치마크인 Forensic Answer-Questioning (FAQ)을 제안합니다. FAQ는 세 단계의 계층 구조를 통해 VLM의 법의학적 능력을 점진적으로 평가하고 향상시킵니다. (1) 얼굴 인식: 정적인 시각적 특징 식별 능력 테스트, (2) 시간적 딥페이크 위치 추론: 프레임 전체에 걸쳐 동적인 위조 특징을 위치시키는 능력 요구, (3) 법의학적 추론: 모델이 최종 진위 여부를 판단하기 위한 증거를 종합하도록 도전합니다. 우리는 FAQ에서 다양한 VLM을 평가하고, FAQ-IT라는 instruction-tuning 데이터셋을 생성했습니다. 광범위한 실험 결과, FAQ-IT로 fine-tuning된 모델은 동일 데이터셋 및 교차 데이터셋 탐지 벤치마크 모두에서 우수한 성능을 보였습니다. 추가적인 분석 결과, 저희의 주요 설계 선택 사항이 VLM의 시간적 추론 능력에 미치는 영향을 검증했으며, FAQ가 이러한 VLM의 시간적 추론 능력 향상에 핵심적인 역할을 한다는 것을 확인했습니다.

Original Abstract

Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!