2603.18662v1 Mar 19, 2026 cs.AI

구조화적 사고를 통한 추론: 시각-텍스트 연계 기하 추론을 위한 벤치마크 및 정책 최적화

Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

Long Ma

Citations: 1

h-index: 1

Haokun Zhao

Citations: 23

h-index: 3

Wanshi Xu

Citations: 8

h-index: 2

Haidong Yuan

Citations: 0

h-index: 0

Songjun Cao

Citations: 266

h-index: 8

Yanghua Xiao

Citations: 115

h-index: 6

기하 추론은 본질적으로 '구조화적 사고'를 필요로 합니다. 이는 문제 조건과 해법 사이의 간극을 좁히기 위해 시각적 자료를 능동적으로 활용하는 것을 의미합니다. 그러나 기존의 멀티모달 대규모 언어 모델(MLLM)은 정적인 도식에 대한 수동적인 추론에 주로 의존하며, 효과적인 시각 자료를 언제, 어떻게 구성해야 하는지에 대한 전략적 지식이 부족합니다. 이러한 문제를 해결하기 위해, 우리는 시각-텍스트 연계 체인-오브-생트(Chain-of-Thought) 프레임워크를 제시합니다. 먼저, 4,334개의 기하 문제를 포함하는 최초의 벤치마크인 GeoAux-Bench를 소개합니다. 이 벤치마크는 텍스트 기반 구성 단계를 실제 시각적 업데이트와 연결합니다. 우리의 예비 연구는 다음과 같은 두 가지 중요한 통찰력을 보여줍니다. (1) 시각-텍스트 연계 자료는 단일 모달리티 자료보다 우수한 성능을 보이며, 기하학적 시너지 효과를 완벽하게 포착할 수 없습니다. (2) 유효한 구성 요소는 엔트로피 감소 효과를 가지며, 이는 추론의 불확실성을 줄이는 것과 강하게 관련됩니다. 이러한 결과를 바탕으로, 우리는 전략적인 구성 능력을 향상시키기 위한 강화 학습 패러다임인 Action Applicability Policy Optimization (A2PO)을 제안합니다. A2PO는 대조적인 샘플링을 통해 필수적인 구성 요소와 불필요한 구성 요소를 구별하여 시각 자료의 적절한 시점과 품질을 조절하기 위해 적응형 보상 형성을 사용합니다. 실험 결과는 우리의 접근 방식이 MLLM이 선택적인 보조 구성 요소를 활용하여 강력한 기존 모델보다 3.51% 향상된 성능을 달성할 수 있음을 보여줍니다. 코드와 데이터는 GitHub에서 확인할 수 있습니다.

Original Abstract

Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!