2601.12260v1 Jan 18, 2026 cs.AI

Docs2Synth: 스캔된 시각적으로 풍부한 문서 이해를 위한 합성 데이터 학습 검색기 프레임워크

Docs2Synth: A Synthetic Data Trained Retriever Framework for Scanned Visually Rich Documents Understanding

Yihao Ding

Citations: 7

h-index: 2

Qiang Sun

Citations: 55

h-index: 5

Puzhen Wu

Citations: 11

h-index: 3

Sirui Li

Citations: 62

h-index: 5

Wei Liu

Citations: 54

h-index: 5

Siwen Luo

Citations: 225

h-index: 7

규제된 도메인에서의 문서 이해(VRDU)는 스캔된 문서가 종종 민감하고, 지속적으로 변화하며, 도메인 특화 지식을 포함하고 있기 때문에 특히 어렵다. 이는 모델 적응을 위한 수동 주석의 부족과 사전 학습된 모델이 도메인 특화 사실을 최신 상태로 유지하기 어렵다는 두 가지 주요 과제를 야기한다. 멀티모달 대형 언어 모델(MLLM)이 강력한 제로샷 능력을 보여주지만, 여전히 환각 현상과 제한적인 도메인 그라운딩 문제를 겪고 있다. 대조적으로, 판별적 시각-언어 사전 학습 모델(VLPM)은 신뢰할 수 있는 그라운딩을 제공하지만 새로운 도메인을 다루기 위해 비용이 많이 드는 주석 작업을 필요로 한다. 본 논문에서는 비공개 및 저자원 도메인을 위한 검색 기반 추론을 가능하게 하는 합성 지도 프레임워크인 Docs2Synth를 제안한다. Docs2Synth는 원시 문서 컬렉션을 자동으로 처리하고, 에이전트 기반 시스템을 통해 다양한 QA 쌍을 생성 및 검증하며, 도메인 관련 증거를 추출하기 위해 경량 시각 검색기를 학습시킨다. 추론 과정에서 검색기는 반복적인 검색-생성 루프를 통해 MLLM과 협력하여 환각 현상을 줄이고 응답의 일관성을 향상시킨다. 더 나아가 Docs2Synth를 사용하기 쉬운 파이썬 패키지로 제공하여 다양한 실제 시나리오에서 플러그 앤 플레이 배포를 가능하게 한다. 여러 VRDU 벤치마크에 대한 실험 결과, Docs2Synth는 인간의 주석 없이도 그라운딩과 도메인 일반화 성능을 상당히 향상시키는 것으로 나타났다.

Original Abstract

Document understanding (VRDU) in regulated domains is particularly challenging, since scanned documents often contain sensitive, evolving, and domain specific knowledge. This leads to two major challenges: the lack of manual annotations for model adaptation and the difficulty for pretrained models to stay up-to-date with domain-specific facts. While Multimodal Large Language Models (MLLMs) show strong zero-shot abilities, they still suffer from hallucination and limited domain grounding. In contrast, discriminative Vision-Language Pre-trained Models (VLPMs) provide reliable grounding but require costly annotations to cover new domains. We introduce Docs2Synth, a synthetic-supervision framework that enables retrieval-guided inference for private and low-resource domains. Docs2Synth automatically processes raw document collections, generates and verifies diverse QA pairs via an agent-based system, and trains a lightweight visual retriever to extract domain-relevant evidence. During inference, the retriever collaborates with an MLLM through an iterative retrieval--generation loop, reducing hallucination and improving response consistency. We further deliver Docs2Synth as an easy-to-use Python package, enabling plug-and-play deployment across diverse real-world scenarios. Experiments on multiple VRDU benchmarks show that Docs2Synth substantially enhances grounding and domain generalization without requiring human annotations.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!