2603.24961v1 Mar 26, 2026 cs.AI

대규모 다중 모드 언어 모델(MLLM)은 학생의 생각을 읽을 수 있는가? 손으로 쓴 수학 문제 풀이 과정 분석

Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

Wei Wang

Citations: 0

h-index: 0

Zhiling Yan

Citations: 2,406

h-index: 12

Dingjie Song

Citations: 98

h-index: 5

Tianlong Xu

Citations: 472

h-index: 8

Xing Fan

Citations: 80

h-index: 3

Haoyang Li

Citations: 0

h-index: 0

Lichao Sun

Citations: 536

h-index: 5

Qingsong Wen

Citations: 547

h-index: 8

Hang Li

Citations: 42

h-index: 2

학생의 손으로 쓴 풀이 과정을 평가하는 것은 맞춤형 교육 피드백에 매우 중요하지만, 다양한 필기 스타일, 복잡한 레이아웃, 그리고 다양한 문제 해결 방식 때문에 고유한 어려움을 야기합니다. 기존의 교육 분야 자연어 처리 기술은 주로 텍스트 응답에 초점을 맞추고 있으며, 실제 손으로 쓴 풀이 과정에 내재된 복잡성과 다중 모드를 간과합니다. 현재의 다중 모드 대규모 언어 모델(MLLM)은 시각적 추론에 뛰어나지만, 일반적으로 '학생의 관점'을 채택하여 정답을 생성하는 것을 우선시하며, 학생의 오류를 진단하는 데는 어려움을 겪습니다. 이러한 격차를 해소하기 위해, 우리는 실제 손으로 쓴 수학 풀이 과정에서 발생하는 오류를 설명하고 분류하도록 설계된 새로운 벤치마크인 ScratchMath를 소개합니다. 본 데이터셋은 중국 초등학교 및 중학교 학생들의 1,720개의 수학 문제 풀이 과정 샘플로 구성되어 있으며, 오류 원인 설명(ECE) 및 오류 원인 분류(ECC)라는 두 가지 주요 작업을 지원하며, 7가지 유형의 오류를 정의합니다. 데이터셋은 여러 단계의 전문가 레이블링, 검토 및 검증을 포함하는 엄격한 인간-기계 협업 방식을 통해 세심하게 주석 처리되었습니다. 우리는 16개의 선도적인 MLLM을 ScratchMath를 사용하여 체계적으로 평가했으며, 특히 시각적 인식 및 논리적 추론에서 인간 전문가와의 상당한 성능 격차를 확인했습니다. 독점 모델이 오픈 소스 모델보다 훨씬 뛰어난 성능을 보였으며, 대규모 추론 모델은 오류 설명에 강력한 잠재력을 보여주었습니다. 모든 평가 데이터 및 프레임워크는 추가 연구를 촉진하기 위해 공개적으로 제공됩니다.

Original Abstract

Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!