2601.16218v1 Jan 03, 2026 cs.CL

M3Kang: 비전-언어 모델에서 다국어, 다중 모드 수학적 추론 능력 평가

M3Kang: Evaluating Multilingual Multimodal Mathematical Reasoning in Vision-Language Models

N. Hadida

Citations: 2

h-index: 1

Aleix Torres-Camps

Citations: 3

h-index: 1

Victor Conchello Vendrell

Citations: 2

h-index: 1

Alex Batlle Casellas

Citations: 2

h-index: 1

Arnau Padr'es Masdemont

Citations: 22

h-index: 1

Jordi Ros-Giralt

Citations: 4

h-index: 1

최첨단 비전-언어 모델(VLM)은 뛰어난 추론 능력을 보여주었지만, 특히 인간의 성능과 비교했을 때, 다국어 환경에서의 수학적 추론 능력은 아직 충분히 연구되지 않았습니다. 이러한 격차를 해소하기 위해, VLM을 위한 최초의 대규모 다국어, 다중 모드 수학적 추론 데이터셋인 M3Kang을 소개합니다. M3Kang은 세계 최대 규모의 수학 경시대회인 Kangaroo Math Competition에서 파생되었으며, 매년 전 세계 90여 개 국가에서 18세 미만의 6백만 명 이상의 참가자가 참여합니다. M3Kang은 1,747개의 고유한 객관식 문제를 학년별 난이도로 구성하고 있으며, 문제 해결에 필수적인 그림을 포함하여 108개의 다양한 문화권 언어로 번역되었습니다. 이 데이터셋을 사용하여, 공개 및 비공개 최첨단 모델에 대한 광범위한 성능 평가를 수행했습니다. 분석 결과, 최근의 발전에도 불구하고 모델은 여전히 기본적인 수학 및 그림 기반 추론에 어려움을 겪으며, 성능은 언어 존재 여부와 모델 크기에 따라 달라지지만, 학년 수준과는 관련이 없는 것으로 나타났습니다. 또한, 다국어 기술이 다중 모드 환경으로 효과적으로 확장될 수 있으며, 이를 통해 기존 방식보다 상당한 성능 향상을 얻을 수 있음을 확인했습니다. 본 연구는 6만 8천 명 이상의 학생 데이터를 활용하여 인간의 성능과 직접적인 비교를 가능하게 합니다. M3Kang 데이터셋 전체와 함께, 영어 전용 하위 집합인 M2Kang, 그리고 데이터셋 구축에 사용된 프레임워크 및 코드를 공개합니다.

Original Abstract

Despite state-of-the-art vision-language models (VLMs) have demonstrated strong reasoning capabilities, their performance in multilingual mathematical reasoning remains underexplored, particularly when compared to human performance. To bridge this gap, we introduce M3Kang, the first massively multilingual, multimodal mathematical reasoning dataset for VLMs. It is derived from the Kangaroo Math Competition, the world's largest mathematics contest, which annually engages over six million participants under the age of 18 across more than 90 countries. M3Kang includes 1,747 unique multiple-choice problems organized by grade-level difficulty, with translations into 108 culturally diverse languages, some of them including diagrams essential for solving them. Using this dataset, we conduct extensive benchmarking on both closed- and open-source SOTA models. We observe that, despite recent advances, models still struggle with basic math and diagram-based reasoning, with performance scaling with language presence and model size, but not with grade level. We also find that multilingual techniques can be effectively extended to the multimodal setting, resulting in significant improvements over baseline approaches. Our analysis also incorporates performance data from over 68,000 students, enabling direct comparison with human performance. We are open-sourcing M3Kang, including the English-only subset M2Kang, along with the framework and codebase used to construct the dataset.

1 Citations

0 Influential

0.5 Altmetric

3.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!