2601.23048v2 Jan 30, 2026 cs.AI

추상에서 맥락으로: LLM이 아직 수학 분야에서 해결하지 못하는 문제

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

Yixia Li

Southern University of Science and Technology

Citations: 141

h-index: 6

Guanhua Chen

Citations: 49

h-index: 1

Bowen Cao

Citations: 1

h-index: 1

Dongdong Zhang

Citations: 12

h-index: 1

Yaokang Wu

Citations: 3

h-index: 1

Wai Lam

Citations: 104

h-index: 2

Furu Wei

Citations: 9

h-index: 2

Junpeng Liu

Citations: 61

h-index: 1

Chufan Shi

Citations: 3

h-index: 1

Hongyuan Lu

Citations: 260

h-index: 9

Shijue Huang

Citations: 31

h-index: 3

최근의 거대 언어 모델(LLM)은 많은 벤치마크 수학 문제를 전문가 수준에 가까운 정확도로 해결하지만, 이러한 발전이 실제 응용 분야에서 신뢰할 수 있는 성능으로 이어지지 않고 있습니다. 본 연구에서는 이러한 격차를 맥락적 수학적 추론을 통해 분석합니다. 맥락적 추론은 설명적인 시나리오로부터 수학적 핵심을 도출해야 하는 경우를 의미합니다. 우리는 ContextMATH라는 벤치마크를 소개합니다. ContextMATH는 AIME 및 MATH-500 문제를 두 가지 맥락적 설정으로 재구성합니다. 첫째, 추상적인 문제를 현실적인 이야기 속에 포함시키는 '시나리오 기반(Scenario Grounding, SG)'으로, 이는 추론 복잡도를 증가시키지 않습니다. 둘째, 명시적인 조건을 하위 문제로 변환하여 실제 환경에서 제약 조건이 나타나는 방식을 반영하는 '복잡도 확장(Complexity Scaling, CS)'입니다. 61개의 독점 및 오픈 소스 모델을 평가한 결과, 상당한 성능 저하가 관찰되었습니다. 오픈 소스 모델은 평균적으로 SG 및 CS에서 각각 13점과 34점, 독점 모델은 각각 13점과 20점의 성능 감소를 보였습니다. 오류 분석 결과, 오류는 주로 부정확한 문제 설정에서 비롯되며, 원본 문제의 난이도가 증가함에 따라 문제 설정의 정확도가 감소하는 경향이 있습니다. 올바른 문제 설정은 성공의 필수 조건이며, 모델 크기가 커질수록 문제 설정 능력과 추론 능력이 모두 향상되는 것을 확인했습니다. 그러나 문제 설정과 추론은 여전히 상호 보완적인 제약 요소로서, 맥락적 수학 문제 해결 능력을 제한합니다. 마지막으로, 시나리오 데이터를 활용한 파인튜닝은 성능 향상에 기여하는 반면, 문제 설정만을 위한 학습은 효과가 미미합니다. 그러나 이러한 성능 격차는 부분적으로만 해소되며, 이는 맥락적 수학적 추론이 LLM에게 여전히 해결해야 할 중요한 과제임을 시사합니다.

Original Abstract

Large language models now solve many benchmark math problems at near-expert levels, yet this progress has not fully translated into reliable performance in real-world applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios. We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into sub-problems to capture how constraints often appear in practice. Evaluating 61 proprietary and open-source models, we observe sharp drops: on average, open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows that errors are dominated by incorrect problem formulation, with formulation accuracy declining as original problem difficulty increases. Correct formulation emerges as a prerequisite for success, and its sufficiency improves with model scale, indicating that larger models advance in both understanding and reasoning. Nevertheless, formulation and reasoning remain two complementary bottlenecks that limit contextual mathematical problem solving. Finally, we find that fine-tuning with scenario data improves performance, whereas formulation-only training is ineffective. However, performance gaps are only partially alleviated, highlighting contextual mathematical reasoning as a central unsolved challenge for LLMs.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!