2604.14967v2 Apr 16, 2026 cs.CV

UniDoc-RL: 계층적 행동과 밀집 보상을 활용한, 거시-미시 시각적 RAG

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Kai Yang

Citations: 18

h-index: 3

Tiancheng Gu

Citations: 221

h-index: 7

Zelong Sun

Citations: 41

h-index: 5

Ziyong Feng

Citations: 418

h-index: 10

Jun Wang

Citations: 21

h-index: 3

Zhiwu Lu

Citations: 77

h-index: 4

Shuo Tan

Citations: 0

h-index: 0

Yong Zhao

Citations: 5

h-index: 2

Retrieval-Augmented Generation (RAG)은 외부 시각적 지식을 활용하여 대규모 시각-언어 모델(LVLM)을 확장하는 기술입니다. 그러나 기존의 시각적 RAG 시스템은 일반적으로 복잡한 추론에 필수적인 미세한 시각적 의미를 간과하는 일반적인 검색 신호에 의존하는 경향이 있습니다. 이러한 한계를 해결하기 위해, 우리는 LVLM 에이전트가 검색, 재순위화, 능동적인 시각적 인식, 그리고 추론을 동시에 수행하는 통합 강화 학습 프레임워크인 UniDoc-RL을 제안합니다. UniDoc-RL은 시각 정보 획득을 계층적 행동 공간을 갖는 순차적 의사 결정 문제로 정의합니다. 구체적으로, 이 모델은 거시적인 문서 검색에서부터 미세한 이미지 선택 및 능동적인 영역 추출에 이르기까지 시각적 증거를 점진적으로 개선하여, 불필요한 내용을 억제하고 정보가 풍부한 영역에 집중할 수 있도록 합니다. 효율적인 엔드-투-엔드 학습을 위해, 우리는 각 행동에 대한 작업 관련 감독을 제공하는 밀집 다중 보상 체계를 도입했습니다. UniDoc-RL은 Group Relative Policy Optimization (GRPO)을 기반으로 하며, 별도의 가치 네트워크 없이 여러 목표에 대한 에이전트의 행동을 조정합니다. 이 학습 패러다임을 지원하기 위해, 우리는 미세한 행동 주석이 포함된 고품질 추론 경로 데이터셋을 구축했습니다. 세 가지 벤치마크에서의 실험 결과, UniDoc-RL은 기존 최고 성능 모델을 일관되게 능가하며, 기존 강화 학습 기반 방법보다 최대 17.7%의 성능 향상을 보였습니다.

Original Abstract

Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!