2603.28618v1 Mar 30, 2026 cs.AI

당신과 함께 보는 것: 다중 모드 추론을 위한 지각-추론의 공진화

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Chen Qian

Citations: 240

h-index: 7

Lijun Li

Citations: 3

h-index: 1

Ziqi Miao

Citations: 20

h-index: 3

Haonan Jia

Citations: 14

h-index: 2

Yuanhao Xiong

Citations: 38

h-index: 3

Wenting Yan

Citations: 0

h-index: 0

Jing Shao

Citations: 33

h-index: 2

검증 가능한 보상을 활용한 강화 학습(RLVR)은 다중 모드 대규모 언어 모델(MLLM)의 추론 능력을 크게 향상시켰습니다. 그러나 기존의 RLVR 방식은 일반적으로 결과 지향적인 최적화를 사용하며, 최종 답변에 기반한 공유된 보상을 통해 지각과 추론을 모두 업데이트합니다. 이러한 공유된 보상은 신용 할당을 모호하게 만들고, 종종 추론 패턴을 개선하는 반면, 상위 레벨의 시각적 증거 추출의 정확성을 안정적으로 향상시키는 데 실패합니다. 이러한 지각적 병목 현상을 해결하기 위해, 우리는 지각-추론의 공진화(PRCO)라는 이중 역할의 RLVR 프레임워크를 제안합니다. PRCO는 질문에 맞춰 증거 설명을 생성하는 '관찰자(Observer)'와 이 설명에 기반하여 최종 답변을 예측하는 '해결사(Solver)'라는 두 가지 협력적인 역할을 포함합니다. 중요한 점은 PRCO가 역할별 보상 신호를 사용한다는 것입니다. 해결사는 최종 답변에 대한 검증 가능한 결과 보상을 사용하여 최적화되고, 관찰자는 해결사의 하위 레벨 성공으로부터 파생된 유틸리티 보상을 받습니다. 8개의 어려운 다중 모드 추론 벤치마크에 대한 광범위한 실험 결과, PRCO는 모델 크기에 관계없이 평균적으로 7점 이상의 정확도 향상을 보여주며, 기존의 오픈 소스 RL 기반 모델보다 뛰어난 성능을 보였습니다.

Original Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!