2602.12395v1 Feb 12, 2026 cs.CV

시각적 추론에서 강화학습은 무엇을 향상시키는가? 프랑켄슈타인 스타일의 분석

What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

Xirui Li

Citations: 319

h-index: 5

Tianyi Zhou

Citations: 946

h-index: 11

Ming Li

Citations: 1,402

h-index: 15

검증 가능한 보상을 수반하는 강화학습(RL)은 비전-언어 모델에서 시각적 추론을 강화하기 위한 표준 사후 학습 단계로 자리 잡았으나, 콜드 스타트 초기화(IN)로서의 지도 미세 조정과 비교할 때 RL이 실제로 어떤 능력을 향상시키는지에 대해서는 여전히 불분명하다. 종단간(End-to-end) 벤치마크의 성능 향상은 여러 요인이 뒤섞여 있어 향상된 부분을 특정 기술의 결과로 귀속시키기 어렵게 만든다. 이러한 간극을 메우기 위해 본 연구는 다음을 포함하는 프랑켄슈타인 스타일의 분석 프레임워크를 제안한다: (i) 인과적 탐색(causal probing)을 통한 기능적 국소화; (ii) 매개변수 비교를 통한 업데이트 특성화; (iii) 모델 병합을 통한 전이성 테스트. 그 대신, RL은 주로 중후반 레이어에서 일관된 추론 시간(inference-time)의 변화를 유발하며, 이러한 중후반부의 정제는 RL로 인한 성능 향상에 있어 (병합을 통해) 전이 가능할 뿐만 아니라 (동결을 통해) 필수적인 것으로 나타났다. 전반적으로 우리의 연구 결과는 시각적 추론에 대한 RL의 확실한 기여가 시각적 인식에 대한 균일한 향상이 아니라, 비전과 추론 간의 정렬 및 추론 성능을 향상시키는 중후반부 트랜스포머 연산의 체계적 정제임을 시사하며, 이는 멀티모달 추론의 성능 개선을 이해하는 데 있어 벤치마크 평가에만 의존하는 방식의 한계를 강조한다.

Original Abstract

Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.

3 Citations

0 Influential

7.5 Altmetric

40.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!