2603.28130v1 Mar 30, 2026 cs.CV

MDPBench: 실제 환경에서의 다국어 문서 파싱을 위한 벤치마크

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios

Yuliang Liu

Citations: 5

h-index: 2

Jiarui Zhang

Citations: 232

h-index: 8

Zhang Li

Citations: 1,635

h-index: 9

Zhi Lin

Citations: 3

h-index: 1

Qiang Liu

Citations: 710

h-index: 4

Ziyang Zhang

Citations: 93

h-index: 3

Shuo Zhang

Citations: 712

h-index: 4

Zidun Guo

Citations: 57

h-index: 2

Jiajun Song

Citations: 136

h-index: 2

Xiang Bai

Citations: 148

h-index: 6

본 논문에서는 다국어 디지털 및 사진 문서 파싱을 위한 최초의 벤치마크인 Multilingual Document Parsing Benchmark (MDPBench)를 소개합니다. 문서 파싱 기술은 눈부신 발전을 이루었지만, 대부분 깨끗하고 디지털화된, 잘 정렬된 페이지를 기반으로 하며, 소수의 주요 언어에 국한되어 있습니다. 본 벤치마크는 다양한 문자 체계와 저자원 언어에 대한 디지털 및 사진 문서 파싱 모델의 성능을 평가하기 위한 체계적인 방법론을 제공합니다. MDPBench는 17개 언어, 다양한 문자 체계, 그리고 다양한 촬영 환경을 포함하는 3,400개의 문서 이미지로 구성되어 있으며, 숙련된 전문가의 모델 라벨링, 수동 수정, 그리고 인간 검증을 거쳐 고품질의 어노테이션을 제공합니다. 공정성을 확보하고 데이터 유출을 방지하기 위해, 공개 평가 데이터셋과 비공개 평가 데이터셋을 분리하여 관리합니다. 오픈 소스 및 클로즈드 소스 모델에 대한 종합적인 평가 결과, 놀라운 사실이 밝혀졌습니다. 클로즈드 소스 모델(특히 Gemini3-Pro)은 상대적으로 안정적인 성능을 보이는 반면, 오픈 소스 모델은 비라틴 문자 체계 및 실제 촬영된 문서에서 성능이 크게 저하되는 경향을 보입니다. 구체적으로, 촬영된 문서에서 평균 17.8%, 비라틴 문자 체계에서 14.0%의 성능 저하가 나타났습니다. 이러한 결과는 언어 및 환경에 따른 성능 불균형을 드러내며, 보다 포괄적이고 실제 배포 가능한 파싱 시스템을 구축하기 위한 구체적인 방향을 제시합니다. 소스 코드는 https://github.com/Yuliang-Liu/MultimodalOCR 에서 확인할 수 있습니다.

Original Abstract

We introduce Multilingual Document Parsing Benchmark, the first benchmark for multilingual digital and photographed document parsing. Document parsing has made remarkable strides, yet almost exclusively on clean, digital, well-formatted pages in a handful of dominant languages. No systematic benchmark exists to evaluate how models perform on digital and photographed documents across diverse scripts and low-resource languages. MDPBench comprises 3,400 document images spanning 17 languages, diverse scripts, and varied photographic conditions, with high-quality annotations produced through a rigorous pipeline of expert model labeling, manual correction, and human verification. To ensure fair comparison and prevent data leakage, we maintain separate public and private evaluation splits. Our comprehensive evaluation of both open-source and closed-source models uncovers a striking finding: while closed-source models (notably Gemini3-Pro) prove relatively robust, open-source alternatives suffer dramatic performance collapse, particularly on non-Latin scripts and real-world photographed documents, with an average drop of 17.8% on photographed documents and 14.0% on non-Latin scripts. These results reveal significant performance imbalances across languages and conditions, and point to concrete directions for building more inclusive, deployment-ready parsing systems. Source available at https://github.com/Yuliang-Liu/MultimodalOCR.

2 Citations

0 Influential

58.022071774821 Altmetric

292.1 Score

Original PDF

815

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!