2603.08369v1 Mar 09, 2026 cs.AI

M$^3$-ACE: 다중 에이전트 컨텍스트 엔지니어링을 통한 다중 모드 수학 추론에서의 시각적 인식 개선

M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering

Zhen Xu

Citations: 44

h-index: 3

Baoxun Wang

Citations: 19

h-index: 3

Peijin Xie

Citations: 7

h-index: 2

Bingquan Liu

Citations: 7

h-index: 2

최근 다중 모드 대규모 언어 모델은 시각적 수학적 추론 분야에서 상당한 발전을 보이고 있습니다. 그러나 이러한 모델의 성능은 종종 간과되어 왔던 중요한 문제점, 즉 부정확한 시각적 인식으로 인해 제한되는 경우가 많습니다. 체계적인 분석을 통해, 대부분의 오류가 추론 능력의 부족보다는 부정확하거나 불완전한 시각적 증거 추출에서 비롯된다는 것을 확인했습니다. 또한, 모델은 초기 인식에 대해 지나치게 확신하는 경향이 있으며, 프롬프트 엔지니어링, 다단계 자기 성찰 또는 사후 지침과 같은 일반적인 전략으로는 오류를 안정적으로 수정하기 어렵습니다. 이러한 한계를 해결하기 위해, 우리는 다중 모드 수학 추론에서 시각적 인식을 개선하도록 설계된 다중 에이전트 컨텍스트 엔지니어링 프레임워크인 M3-ACE를 제안합니다. 당사의 접근 방식은 최종 답변을 직접 통합하는 대신, 시각적 증거 목록을 중심으로 하는 공유 컨텍스트를 동적으로 유지함으로써 인식과 추론을 분리합니다. 여러 에이전트가 협력하여 상호 보완적인 관찰을 제공함으로써, 시스템은 불일치를 파악하고 누락된 시각적 정보를 복구할 수 있습니다. 안정적인 다중 턴 협력을 지원하기 위해, 우리는 또한 두 가지 경량 도구를 추가로 도입했습니다. 첫 번째는 다양한 에이전트로부터 수집된 증거를 일관되고 상호 보완적이며 충돌하는 구성 요소로 구성하는 요약 도구이고, 두 번째는 신뢰할 수 없는 샘플을 필터링하고 반복적인 수정을 안내하는 정제 도구입니다. 광범위한 실험 결과, M3-ACE는 여러 벤치마크에서 시각적 수학적 추론 성능을 크게 향상시키는 것으로 나타났습니다. 당사의 방법은 MathVision 벤치마크에서 89.1의 새로운 최고 성능을 달성했으며, MathVista 및 MathVerse를 포함한 다른 관련 데이터 세트에서도 일관된 성능 향상을 보였습니다. 이러한 결과는 다중 모드 추론 시스템을 발전시키기 위한 인식 중심의 다중 에이전트 협력의 중요성을 강조합니다.

Original Abstract

Multimodal large language models have recently shown promising progress in visual mathematical reasoning. However, their performance is often limited by a critical yet underexplored bottleneck: inaccurate visual perception. Through systematic analysis, we find that the most failures originate from incorrect or incomplete visual evidence extraction rather than deficiencies in reasoning capability. Moreover, models tend to remain overly confident in their initial perceptions, making standard strategies such as prompt engineering, multi-round self-reflection, or posterior guidance insufficient to reliably correct errors. To address this limitation, we propose M3-ACE, a multi-agentic context engineering framework designed to rectify visual perception in multimodal math reasoning. Instead of directly aggregating final answers, our approach decouples perception and reasoning by dynamically maintaining a shared context centered on visual evidence lists. Multiple agents collaboratively contribute complementary observations, enabling the system to expose inconsistencies and recover missing perceptual information. To support stable multi-turn collaboration, we further introduce two lightweight tools: a Summary Tool that organizes evidence from different agents into consistent, complementary, and conflicting components, and a Refine Tool that filters unreliable samples and guides iterative correction. Extensive experiments demonstrate that M3-ACE substantially improves visual mathematical reasoning performance across multiple benchmarks. Our method establishes new state-of-the-art results 89.1 on the MathVision benchmark and achieves consistent improvements on other related datasets, including MathVista and MathVerse. These results highlight the importance of perception-centric multi-agent collaboration for advancing multimodal reasoning systems.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!