2604.02694v1 Apr 03, 2026 cs.CV

DocShield: 증거 기반의 에이전트 추론을 통한 인공지능 문서 안전성 확보 연구

DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning

Changtao Miao

Citations: 573

h-index: 12

Fanwei Zeng

Citations: 8

h-index: 2

Jing Huang

Citations: 8

h-index: 2

Zhiya Tan

Citations: 22

h-index: 3

Shutao Gong

Citations: 8

h-index: 2

Xiaomin Yu

Citations: 64

h-index: 3

Yang Wang

Citations: 6

h-index: 2

Weibin Yao

Citations: 44

h-index: 4

Joey Tianyi Zhou

Citations: 41

h-index: 5

Jianshu Li

Citations: 48

h-index: 4

Yingdong Yan

Citations: 8

h-index: 1

생성형 인공지능 기술의 급속한 발전은 점점 더 현실적인 텍스트 기반 이미지 위조를 가능하게 하여 문서 안전에 심각한 문제를 야기하고 있습니다. 기존의 법의학적 방법은 주로 시각적인 단서에 의존하며, 미묘한 텍스트 조작을 밝히기 위한 증거 기반의 추론 능력이 부족합니다. 탐지, 위치 추적, 그리고 설명은 종종 독립적인 작업으로 취급되어 신뢰성과 해석 가능성을 제한합니다. 이러한 문제점을 해결하기 위해, 본 연구에서는 텍스트 기반 위조 분석을 시각-논리적 통합 추론 문제로 정의하는 최초의 통합 프레임워크인 DocShield를 제안합니다. DocShield의 핵심은 Cross-Cues-aware Chain of Thought (CCT) 메커니즘으로, 이는 텍스트 의미론과 시각적 이상 현상을 반복적으로 교차 검증하여 일관되고 증거 기반의 법의학적 분석을 수행하는 에이전트 추론을 가능하게 합니다. 또한, GRPO 기반 최적화를 위한 가중치 멀티 태스크 보상을 도입하여 추론 구조, 공간적 증거, 그리고 진위 예측을 일치시킵니다. 본 연구에서는 DocShield를 보완하기 위해, 픽셀 단위 조작 마스크와 전문가 수준의 텍스트 설명을 포함하는 다국어 문서 이미지 데이터셋인 RealText-V1을 구축했습니다. 광범위한 실험 결과, DocShield는 기존 방법보다 성능이 크게 향상되었으며, T-IC13 데이터셋에서 전문 프레임워크 대비 41.4%, GPT-4o 대비 23.4%의 F1 점수 향상을 보였으며, 어려운 T-SROIE 벤치마크에서도 일관된 성능 향상을 보였습니다. 본 연구에서 개발한 데이터셋, 모델, 그리고 코드는 공개될 예정입니다.

Original Abstract

The rapid progress of generative AI has enabled increasingly realistic text-centric image forgeries, posing major challenges to document safety. Existing forensic methods mainly rely on visual cues and lack evidence-based reasoning to reveal subtle text manipulations. Detection, localization, and explanation are often treated as isolated tasks, limiting reliability and interpretability. To tackle these challenges, we propose DocShield, the first unified framework formulating text-centric forgery analysis as a visual-logical co-reasoning problem. At its core, a novel Cross-Cues-aware Chain of Thought (CCT) mechanism enables implicit agentic reasoning, iteratively cross-validating visual anomalies with textual semantics to produce consistent, evidence-grounded forensic analysis. We further introduce a Weighted Multi-Task Reward for GRPO-based optimization, aligning reasoning structure, spatial evidence, and authenticity prediction. Complementing the framework, we construct RealText-V1, a multilingual dataset of document-like text images with pixel-level manipulation masks and expert-level textual explanations. Extensive experiments show DocShield significantly outperforms existing methods, improving macro-average F1 by 41.4% over specialized frameworks and 23.4% over GPT-4o on T-IC13, with consistent gains on the challenging T-SROIE benchmark. Our dataset, model, and code will be publicly released.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!