2604.04634v1 Apr 06, 2026 cs.CV

위조 유물 보존: 원본 해상도 기반의 AI 생성 동영상 탐지

Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale

Jingyong Su

Citations: 99

h-index: 3

Zheng Li

Citations: 69

h-index: 3

Chenyang Jiang

Harbin Institute of Technology (Shenzhen)

Citations: 3

h-index: 1

Feng Gao

Citations: 2

h-index: 1

Qiben Shan

Citations: 17

h-index: 1

Hang Zhao

Citations: 79

h-index: 1

Shiyang Zhou

Citations: 19

h-index: 2

Fan Yang

Citations: 7

h-index: 1

Shaocong Wu

Citations: 3

h-index: 1

Yunyang Mo

Citations: 1

h-index: 1

동영상 생성 모델의 급속한 발전으로 인해 매우 사실적인 합성 미디어가 생성되면서, 오정보 확산에 대한 심각한 사회적 우려가 제기되고 있습니다. 그러나 현재의 탐지 방법은 중요한 한계를 가지고 있습니다. 이러한 방법들은 고정 해상도로 크기를 조정하고 잘라내는 등의 전처리 작업을 수행하는데, 이러한 작업은 미세한 고주파 위조 흔적을 삭제할 뿐만 아니라 공간 왜곡과 상당한 정보 손실을 초래합니다. 또한, 기존 방법들은 종종 오래된 데이터 세트로 훈련되고 평가되어, 현대적인 생성 모델의 정교함을 제대로 반영하지 못합니다. 이러한 문제점을 해결하기 위해, 우리는 포괄적인 데이터 세트와 새로운 탐지 프레임워크를 소개합니다. 첫째, 우리는 15개의 최첨단 오픈 소스 및 상업용 생성기에서 생성된 14만 개 이상의 동영상으로 구성된 대규모 데이터 세트를 구축했으며, 특히 매우 사실적인 합성 콘텐츠를 평가하기 위해 설계된 Magic Videos 벤치마크를 포함합니다. 또한, 우리는 Qwen2.5-VL Vision Transformer를 기반으로 구축된 새로운 탐지 프레임워크를 제안합니다. 이 프레임워크는 다양한 공간 해상도와 시간 지속 시간에서 원본 해상도로 작동하며, 기존 전처리 과정에서 일반적으로 손실되는 고주파 아티팩트와 시공간적 불일치를 효과적으로 보존합니다. 광범위한 실험 결과, 제안하는 방법이 여러 벤치마크에서 우수한 성능을 달성했으며, 이는 원본 해상도 기반 처리의 중요성을 강조하고 AI 생성 동영상 탐지를 위한 견고한 새로운 기준을 제시합니다.

Original Abstract

The rapid advancement of video generation models has enabled the creation of highly realistic synthetic media, raising significant societal concerns regarding the spread of misinformation. However, current detection methods suffer from critical limitations. They rely on preprocessing operations like fixed-resolution resizing and cropping. These operations not only discard subtle, high-frequency forgery traces but also cause spatial distortion and significant information loss. Furthermore, existing methods are often trained and evaluated on outdated datasets that fail to capture the sophistication of modern generative models. To address these challenges, we introduce a comprehensive dataset and a novel detection framework. First, we curate a large-scale dataset of over 140K videos from 15 state-of-the-art open-source and commercial generators, along with Magic Videos benchmark designed specifically for evaluating ultra-realistic synthetic content. In addition, we propose a novel detection framework built on the Qwen2.5-VL Vision Transformer, which operates natively at variable spatial resolutions and temporal durations. This native-scale approach effectively preserves the high-frequency artifacts and spatiotemporal inconsistencies typically lost during conventional preprocessing. Extensive experiments demonstrate that our method achieves superior performance across multiple benchmarks, underscoring the critical importance of native-scale processing and establishing a robust new baseline for AI-generated video detection.

1 Citations

0 Influential

1.5 Altmetric

8.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!