2604.22597v1 Apr 24, 2026 cs.AI

수학적 추론 평가에 대한 재고: 상징적 엄격성을 넘어선 강력한 LLM 기반 평가 프레임워크

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

E. Yosef

Citations: 45

h-index: 5

Oron Anschel

Citations: 905

h-index: 8

Shunit Haviv Hakimi

Citations: 19

h-index: 2

Asaf Gendler

Citations: 86

h-index: 3

Adam Botach

Citations: 0

h-index: 0

Nimrod Berman

Citations: 115

h-index: 6

I. Kviatkovsky

Citations: 637

h-index: 7

최근 대규모 언어 모델(LLM)의 발전은 다양한 작업에서 상당한 개선을 가져왔으며, 특히 논리적 추론 및 문제 해결 능력을 평가하는 데 사용되는 수학적 추론 분야에서 두드러집니다. 모델은 수학적 추론 벤치마크에서 최종 답변의 정확성을 정답과 비교하여 평가됩니다. 이러한 검증의 일반적인 방법은 기호 수학 비교에 기반하지만, 이는 다양한 수학적 표현 및 해법 형식을 포괄하지 못합니다. 본 연구에서는 규칙 기반 기호 수학 비교에 대한 강력하고 유연한 대안을 제시합니다. 모델이 생성한 답변을 평가하기 위한 LLM 기반 평가 프레임워크를 제안하며, 이를 통해 다양한 수학적 표현 및 답변 형식을 기반으로 정확한 평가가 가능합니다. 인기 있는 Lighteval 및 SimpleRL 프레임워크에서 기호 평가의 실패 사례를 제시하고, 제안하는 방법과 비교하여 기존 방법보다 훨씬 우수한 성능을 보여줍니다. 본 프레임워크는 보다 신뢰할 수 있는 평가 및 벤치마킹을 가능하게 하며, 이는 수학적 문제 해결 및 지능형 시스템 발전을 위한 정확한 성능 모니터링에 중요합니다.

Original Abstract

Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer. A common approach for this verification is based on symbolic mathematics comparison, which fails to generalize across diverse mathematical representations and solution formats. In this work, we offer a robust and flexible alternative to rule-based symbolic mathematics comparison. We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats. We present failure cases of symbolic evaluation in two popular frameworks, Lighteval and SimpleRL, and compare them to our approach, demonstrating clear improvements over commonly used methods. Our framework enables more reliable evaluation and benchmarking, leading to more accurate performance monitoring, which is important for advancing mathematical problem-solving and intelligent systems.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!