2603.24326v1 Mar 25, 2026 cs.CV

조잡-세밀 시각 처리 기법을 활용한 문서 파싱 효율 및 성능 향상

Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

Dianhai Yu

Citations: 4,251

h-index: 27

Cheng Cui

Citations: 281

h-index: 5

Changda Zhou

Citations: 160

h-index: 4

Ting Sun

Citations: 183

h-index: 5

Suyin Liang

Citations: 60

h-index: 2

Tingquan Gao

Citations: 343

h-index: 5

Zelun Zhang

Citations: 158

h-index: 3

Jiaxuan Liu

Citations: 157

h-index: 3

Xueqing Wang

Citations: 160

h-index: 4

Hongen Liu

Citations: 165

h-index: 4

Manhui Lin

Citations: 156

h-index: 3

Yue Zhang

Citations: 169

h-index: 4

Yubo Zhang

Citations: 158

h-index: 3

Jing Zhang

Citations: 627

h-index: 9

Jun Zhang

Citations: 36

h-index: 3

Xing Wei

Citations: 10

h-index: 1

Yi Liu

Citations: 26

h-index: 3

Yanjun Ma

Citations: 1,882

h-index: 16

문서 파싱은 이미지 해상도가 성능에 큰 영향을 미치는 정교한 작업입니다. 최첨단 연구에서 시각-언어 모델을 활용하여 모델 성능을 향상시키기 위해 고해상도 이미지를 사용하는 경우가 많지만, 이는 시각 토큰의 수를 제곱으로 증가시키고 계산 비용을 크게 증가시킵니다. 이러한 비효율성은 문서 이미지 내의 상당한 시각 영역 중복성, 예를 들어 배경과 같은 요소 때문이라고 판단됩니다. 이러한 문제를 해결하기 위해, 우리는 의미적으로 관련된 영역에 집중하고 중복된 영역을 억제하여 효율성과 성능을 모두 향상시키는 새로운 조잡-세밀 구조인 PaddleOCR-VL을 제안합니다. 구체적으로, 우리는 위치 정보 및 문맥 관계 예측 기능을 활용하여 유효한 시각 토큰을 식별하는 경량의 유효 영역 집중 모듈(VRFM)을 도입했습니다. 이후, VRFM의 출력 결과를 활용하여 전체 큰 이미지를 직접 처리하지 않고 상세한 인식을 수행하는 0.9B 규모의 작지만 강력한 시각-언어 모델(PaddleOCR-VL-0.9B)을 설계하고 학습했습니다. 광범위한 실험 결과, PaddleOCR-VL은 페이지 수준 파싱 및 요소 수준 인식 모두에서 최첨단 성능을 달성했습니다. 기존 솔루션보다 훨씬 뛰어난 성능을 보이며, 최상위 수준의 VL 모델과 경쟁력을 갖추고 있으며, 훨씬 적은 수의 시각 토큰과 파라미터를 사용하면서 빠른 추론 속도를 제공합니다. 이는 정확하고 효율적인 문서 이해를 위한 표적 조잡-세밀 파싱의 효과를 강조합니다. 소스 코드 및 모델은 https://github.com/PaddlePaddle/PaddleOCR 에서 공개적으로 이용할 수 있습니다.

Original Abstract

Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

0 Citations

0 Influential

33.5 Altmetric

167.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!