2602.00564v1 Jan 31, 2026 cs.AI

추론 과정 규명: LLM의 구조적 수학 추론 평가를 위한 과정 인식 벤치마크

Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Wei Wang

Citations: 0

h-index: 0

Xiaoxiao Xu

Citations: 16

h-index: 2

Weiqi Zhai

Citations: 1

h-index: 1

Ya-Qi Mo

Citations: 378

h-index: 2

Ze Xu

Citations: 31

h-index: 2

Xiang-Xiang Zheng

Citations: 1

h-index: 1

Boyu Yang

Citations: 13

h-index: 1

Wenbo Li

Citations: 66

h-index: 4

Rui Luo

Citations: 17

h-index: 2

Yucheng Wang

Citations: 2

h-index: 1

Zhengze Li

Citations: 0

h-index: 0

Meng Wang

Citations: 7

h-index: 2

Yuetian Du

Citations: 5

h-index: 2

Guojie Lin

Citations: 1

h-index: 1

Y. Wang

Citations: 19

h-index: 2

Xuan Ren

Citations: 9

h-index: 2

Hu Wei

Citations: 26

h-index: 3

최근 대규모 언어 모델(LLM)들은 기존의 많은 수학적 추론 벤치마크에서 포화 상태에 가까운 정확도를 달성하고 있어, 진정한 추론 능력을 진단할 수 있는지에 대한 우려를 낳고 있다. 이러한 포화 상태는 주로 기존 데이터셋이 템플릿 기반 계산과 얕은 산술적 분해에 치중되어 있어, 다중 제약 조건 조정, 구성적 논리 합성, 공간적 추론과 같은 추론 기술이 과소 대표되고 있는 데서 기인한다. 이러한 격차를 해소하기 위해, 우리는 구조적 추론을 평가하도록 명시적으로 설계된 150개의 엄선된 문제로 구성된 벤치마크인 ReasoningMath-Plus를 소개한다. 각 문제는 상호 작용하는 제약 조건 하에서의 추론, 구성적 해결책 형성 또는 비자명한 구조적 통찰을 강조하며, 세밀한 과정 수준의 평가를 지원하기 위해 최소한의 추론 골격(minimal reasoning skeleton)이 주석으로 포함되어 있다. 데이터셋과 함께, 우리는 결정론적 단계별 채점 함수인 HCRS(Hazard-aware Chain-based Rule Score)를 소개하고, 주석 처리된 추론 기록을 바탕으로 과정 보상 모델(PRM)을 학습시킨다. 실증적으로, 선도적인 모델들이 비교적 높은 최종 정답 정확도(최대 5.8/10)를 달성하는 반면, HCRS 기반의 종합 평가에서는 상당히 낮은 점수(평균 4.36/10, 최고 5.14/10)를 기록하였으며, 이는 정답 중심의 지표가 추론의 견고성을 과대평가할 수 있음을 보여준다.

Original Abstract

Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!