2605.01284v1 May 02, 2026 cs.CV

증거 사슬: 반복적인 검색 증강 생성 (Retrieval-Augmented Generation)을 위한 픽셀 단위 시각적 증거 추적

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

Di Liang

Citations: 56

h-index: 5

Peiyang Liu

Citations: 124

h-index: 7

Xi Wang

Citations: 149

h-index: 6

Ziqiang Cui

Citations: 151

h-index: 7

Wei Ye

Citations: 31

h-index: 2

반복적인 검색 증강 생성 (iRAG)은 외부 문서를 순차적으로 검색하고 추론하여 복잡한 다중 단계 질문에 답변하는 강력한 방법론으로 부상했습니다. 그러나 현재 시스템은 주로 파싱된 텍스트를 기반으로 작동하며, 이는 다음과 같은 두 가지 중요한 병목 현상을 야기합니다. (1) extit{거친 수준의 증거 제시}: 사용자는 모호한 텍스트 수준의 인용을 기반으로 긴 문서 내에서 증거를 수동으로 찾아야 하는 부담을 안게 됩니다. (2) extit{시각적 의미 손실}: 시각적으로 풍부한 문서(예: 슬라이드, 차트가 포함된 PDF)를 텍스트로 변환하는 과정에서 추론에 필수적인 공간 논리 및 레이아웃 정보가 손실됩니다. 이러한 격차를 해소하기 위해, 우리는 Vision-Language 모델을 활용하여 검색된 문서 후보의 스크린샷에 직접 추론하는 retriever-agnostic (검색기 독립적) 시각적 증거 추적 프레임워크인 **Chain of Evidence (CoE)**를 제시합니다. CoE는 형식별 파싱을 제거하고 정확한 경계 상자를 출력하여, 검색된 후보 집합 내에서 전체 추론 과정을 시각적으로 보여줍니다. 우리는 CoE를 두 가지 다양한 벤치마크에서 평가했습니다. 첫째, 2WikiMultiHopQA에서 파생된 구조화된 웹 페이지의 대규모 데이터셋인 **Wiki-CoE**이고, 둘째는 복잡한 다이어그램과 자유 형식 레이아웃을 특징으로 하는 프레젠테이션 슬라이드 데이터셋인 **SlideVQA**입니다. 실험 결과, 미세 조정된 Qwen3-VL-8B-Instruct 모델이 시각적 레이아웃 이해가 필요한 시나리오에서 텍스트 기반 모델보다 훨씬 뛰어난 성능을 보였으며, 픽셀 단위 해석이 가능한 iRAG를 위한 retriever-agnostic 솔루션을 제공합니다. 저희 코드는 https://github.com/PeiYangLiu/CoE.git 에서 이용하실 수 있습니다.

Original Abstract

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

0 Citations

0 Influential

23.5 Altmetric

117.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!