2603.27494v1 Mar 29, 2026 cs.CV

집중 학습 및 정밀한 영역 추출: 정보 격차와 접지 손실을 활용한 강화 학습 프레임워크를 통한 멀티모달 대규모 언어 모델 학습

Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

Dianmo Sheng

Citations: 155

h-index: 5

Tao Gong

Citations: 338

h-index: 8

Nenghai Yu

Citations: 482

h-index: 10

Xu Zhao

Citations: 118

h-index: 2

Zhentao Tan

Citations: 1,208

h-index: 16

Tianxiang Chen

Citations: 319

h-index: 7

Yao Liu

Citations: 177

h-index: 2

Yue Wu

Citations: 333

h-index: 7

Qi Chu

Citations: 1,494

h-index: 21

복잡한 시각적 장면에서 멀티모달 대규모 언어 모델(MLLM)의 인식 및 추론 능력을 향상시키기 위해, 최근 연구에서는 에이전트 기반 워크플로우가 도입되었습니다. 이러한 연구에서 MLLM은 질문 답변을 위해 이미지 크로핑 도구를 자율적으로 사용하여 관심 영역을 분석합니다. 기존의 지도 학습 및 강화 학습과 같은 학습 전략은 상당한 발전을 이루었지만, 우리의 실증적 분석 결과, 중요한 한계점이 존재합니다. 우리는 모델이 전체 입력에 크게 의존하고, 추출된 영역 내의 세부 사항에는 약하게 의존한다는 것을 보여줍니다. 이 문제를 해결하기 위해, 우리는 경로 감독 없이 작동하는 새로운 2단계 강화 학습 프레임워크를 제안합니다. 첫 번째 단계에서는 "정보 격차(Information Gap)" 메커니즘을 도입하여 전체 이미지의 세분성을 조정합니다. 이 메커니즘은 모델이 추출된 핵심 영역에 집중하여 질문에 답하도록 훈련하며, 이러한 영역이 제공하는 정보 획득량을 활용합니다. 두 번째 단계에서는 적은 수의 바운딩 박스 주석을 사용하여 접지 손실(grounding loss)을 통합하여 크로핑 정밀도를 더욱 향상시킵니다. 실험 결과, 우리의 방법은 모델이 추출된 영역에 더 집중하도록 하여, 고해상도 시각 질의 응답 벤치마크에서 최첨단 성능을 달성할 수 있음을 보여줍니다. 우리의 방법은 MLLM에서 미세한 세부 사항을 인식하고 추론하는 데 더욱 효율적인 접근 방식을 제공합니다. 코드는 다음 위치에서 확인할 수 있습니다: https://github.com/XuanPu-Z/LFPC

Original Abstract

To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model's strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the ``Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to cropped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs. Code is available at: https://github.com/XuanPu-Z/LFPC.

0 Citations

0 Influential

30.5 Altmetric

152.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!