2604.20755v1 Apr 22, 2026 cs.AI

V-tableR1: 비평 기반 정책 최적화를 통한 프로세스 기반 멀티모달 테이블 추론

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

Cao Liu

Citations: 19

h-index: 3

Ke Zeng

Citations: 106

h-index: 6

Xuxin Cheng

Citations: 81

h-index: 3

Yubo Jiang

Citations: 3

h-index: 1

Xin Yang

Citations: 2

h-index: 1

Abudukelimu Wuerkaixi

Citations: 162

h-index: 8

Feng-ying Xie

Citations: 3,588

h-index: 31

Haopeng Zhang

University of Hawaii at Manoa

Citations: 1,041

h-index: 9

Yitong An

Citations: 0

h-index: 0

Zhiguo Jiang

Citations: 219

h-index: 8

본 논문에서는 V-tableR1이라는 프로세스 기반 감독 강화 학습 프레임워크를 소개합니다. V-tableR1은 멀티모달 대규모 언어 모델(MLLM)로부터 엄격하고 검증 가능한 추론을 이끌어냅니다. 현재 최종 결과만을 학습한 MLLM은 시각적 추론을 블랙박스로 취급하며, 엄격한 다단계 추론보다는 피상적인 패턴 매칭에 의존하는 경향이 있습니다. 검증 가능한 보상을 활용한 강화 학습은 투명한 추론 경로를 강제할 수 있지만, 이를 시각 도메인으로 확장하는 것은 추상적인 논리를 연속적인 픽셀 공간에 연결하는 데 어려움이 있습니다. 우리는 테이블의 결정적인 격자 구조를 이상적인 시각적 테스트베드로 활용하여 이러한 문제를 해결합니다. V-tableR1은 특수하게 설계된 비평 VLM을 사용하여 정책 VLM이 생성하는 명시적인 시각적 추론 과정을 단계별로 평가하고 상세한 피드백을 제공합니다. 이 시스템을 최적화하기 위해, 우리는 프로세스 보상, 분리된 정책 제약 조건 및 길이 인지 동적 샘플링을 통합한 새로운 강화 학습 알고리즘인 프로세스 기반 직접 정렬 정책 최적화(PGPO)를 제안합니다. 광범위한 실험 결과는 V-tableR1이 시각적 환각 및 단순한 추측을 명시적으로 억제한다는 것을 보여줍니다. V-tableR1은 멀티모달 추론 방식을 블랙박스 패턴 매칭에서 검증 가능한 논리적 유도로 근본적으로 변화시킴으로써, 복잡한 테이블 벤치마크에서 오픈 소스 모델 중 최고 수준의 정확도를 달성했으며, 최대 18배 더 큰 모델보다 우수한 성능을 보이고, 사전 학습된 모델(SFT)의 성능을 향상시켰습니다.

Original Abstract

We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm integrating process rewards, decoupled policy constraints, and length-aware dynamic sampling. Extensive evaluations demonstrate that V-tableR1 explicitly penalizes visual hallucinations and shortcut guessing. By fundamentally shifting multimodal inference from black-box pattern matching to verifiable logical derivation, V-tableR1 4B establishes state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18x its size and improving over its SFT baseline

0 Citations

0 Influential

15.5 Altmetric

77.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!