2605.01203v1 May 02, 2026 cs.AI

GR-Ben: 프로세스 보상 모델의 성능 평가를 위한 일반적인 추론 벤치마크

GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

Yang Zhao

Citations: 70

h-index: 4

Bibo Cai

Citations: 94

h-index: 6

Kai Xiong

Research Center for Social Computing and Information Retrieval

Citations: 406

h-index: 9

Zhouhao Sun

Citations: 87

h-index: 5

Bing Qin

Citations: 411

h-index: 12

Ting Liu

Citations: 1,454

h-index: 15

Li Du

Citations: 40

h-index: 4

Xuan Zhang

Citations: 95

h-index: 2

Xiaofeng Ding

Citations: 460

h-index: 5

Xinran Dai

Citations: 7

h-index: 1

Fei Zhang

Citations: 26

h-index: 3

Wei Tang

Citations: 128

h-index: 3

Zhiyu Kan

Citations: 0

h-index: 0

현재 프로세스 보상 모델(PRM)은 테스트 시간 확장 측면에서 놀라운 잠재력을 보여주고 있습니다. 대규모 언어 모델(LLM)이 다양한 추론 및 의사 결정 작업을 수행할 때 종종 잘못된 중간 추론 단계를 생성하기 때문에, PRM은 실제 시나리오에서 프로세스 수준의 오류를 감지하는 능력을 갖추어야 합니다. 그러나 기존 벤치마크는 주로 수학적 추론에 초점을 맞추고 있어, PRM의 오류 감지 능력을 다양한 추론 시나리오에서 종합적으로 평가하는 데 한계가 있습니다. 이러한 격차를 해소하기 위해, 우리는 PRM의 성능을 평가하기 위해 특별히 설계된 프로세스 수준 벤치마크인 GR-Ben을 소개합니다. GR-Ben은 두 가지 주요 추론 영역(과학 및 논리)과 9개의 하위 영역에 걸쳐 PRM의 성능을 평가합니다. 우리는 다양한 PRM 및 LLM을 포함한 22개의 모델에 대한 광범위한 실험을 수행하고 다음과 같은 두 가지 주요 결과를 얻었습니다. (1) 수학적 추론 영역을 넘어, 기존 PRM 및 LLM의 오류 감지 능력은 현저히 약한 것으로 나타났습니다. (2) 일반적으로 PRM은 지식 기반 오류를 식별하는 데 덜 능숙하며, LLM은 계산 오류를 감지하는 데 성능이 저하되는 경향이 있습니다. 우리는 GR-Ben이 일반 영역의 PRM에 대한 향후 연구를 촉진하여 LLM의 추론 능력을 향상시키는 데 기여하기를 바랍니다.

Original Abstract

Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM's performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond mathematical reasoning, the error-detection ability of existing PRMs and LLMs is found to be markedly weaker by comparison.(2) In general, PRMs are less adept at identifying knowledge-based errors, whereas LLMs exhibit poorer performance in detecting computational errors.We hope GR-Ben can foster future researches on PRMs for general domains, thereby enhancing the reasoning capabilities of LLMs.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!