2603.13349v1 Mar 07, 2026 cs.CV

MURE: 비전-언어 모델을 활용한 계층적 다중 해상도 인코딩을 통한 시각 문서 검색

MURE: Hierarchical Multi-Resolution Encoding via Vision-Language Models for Visual Document Retrieval

Fuli Feng

Citations: 1,230

h-index: 20

Wenjie Wang

Citations: 1

h-index: 1

Fengbin Zhu

National University of Singapore

Citations: 1,289

h-index: 14

Zijing Cai

Citations: 28

h-index: 3

Yuzhe Wang

Citations: 20

h-index: 3

Pengyang Shao

Citations: 309

h-index: 7

Richang Hong

Citations: 58

h-index: 2

Tat-Seng Chua

Citations: 2,164

h-index: 21

시각 문서 검색(VDR)은 검색 효율성을 보장하면서도 계산 효율성을 유지하기 위해 미세한 시각적 세부 사항과 전체 문서 구조를 모두 포착하는 표현이 필요합니다. 기존 VDR 모델은 고해상도 문서를 처리할 때 효과성과 효율성의 균형을 맞추는 데 어려움을 겪습니다. 이러한 모델들은 종종 미세한 정보를 잃거나 과도한 수의 시각적 토큰을 생성하여 상당한 인덱싱 오버헤드와 높은 검색 지연 시간을 초래합니다. 본 연구에서는 시각적 인코딩 메커니즘을 재검토하고 다중 해상도 샘플링 및 인코딩, 교차 입자 수준 특징 융합, 그리고 적응적 표현 증류를 통해 발전하는 새로운 X-VisEmb 패러다임을 제안합니다. 초기 연구를 통해 다양한 크기에서 상호 보완적인 시각적 단서를 캡처하는 데 있어 그 타당성과 효과성이 입증되었습니다. 이러한 통찰력을 바탕으로, 본 연구에서는 VL 모델을 계층적 다중 해상도 인코더로 활용하고, 효과적인 특징 융합을 위한 해상도 수준의 Matryoshka 표현 학습(RMRL)을 통합하며, 시각적 토큰 압축을 위한 의미론적으로 인식하는 계층적 클러스터링 메커니즘을 적용하는 새로운 프레임워크인 MURE를 개발했습니다. 널리 사용되는 두 가지 VDR 벤치마크에 대한 실험 결과, MURE 프레임워크가 강력한 기본 모델들을 꾸준히 능가하는 것으로 나타났습니다. 또한, MURE는 시각적 토큰 예산의 50%만을 사용하여 ColPali보다 훨씬 뛰어난 성능을 보였습니다.

Original Abstract

Visual Document Retrieval (VDR) requires representations that capture both fine-grained visual details and global document structure to ensure retrieval efficacy while maintaining computational efficiency. Existing VDR models struggle to balance effectiveness and efficiency when processing high-resolution documents: they often either lose fine-grained information or generate an excessive number of visual tokens, resulting in significant indexing overhead and high retrieval latency. In this work, we rethink the visual encoding mechanism and propose a new X-VisEmb paradigm that progresses from multi-resolution sampling and encoding, through cross-granularity feature fusion, to adaptive representation distillation. A preliminary study validates its feasibility and effectiveness in capturing complementary visual cues at varying scales. Building on the insights, we develop MURE, a novel framework that employs VLMs as a hierarchical multi-resolution encoder, integrates resolution-level Matryoshka representation learning (RMRL) for effective feature fusion, and applies a semantic-aware hierarchical clustering mechanism for visual token compression. Experiments on two widely used VDR benchmarks show that our MURE framework consistently beats strong baselines. Furthermore, it significantly outperforms ColPali with only 50% of its visual token budget.

0 Citations

0 Influential

10.5 Altmetric

52.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!