2604.21611v1 Apr 23, 2026 cs.CL

구두 비판을 통한 과정 감독이 대규모 언어 모델의 추론 능력을 향상시킨다

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

Citations: 26

h-index: 2

대규모 언어 모델(LLM)의 추론 성능 향상을 위한 방법은 연쇄 깊이, 샘플 다양성, 그리고 학습된 단계 점수 부여 모델(PRM)의 세 가지 측면에 집중되어 왔습니다. 본 연구에서는 외부 구두 감독의 세부 수준을 활용한 새로운 접근 방식인 Verbal Process Supervision (VPS)을 제안합니다. VPS는 학습 과정 없이, 더 강력한 감독 모델로부터 제공되는 체계적인 자연어 비판을 활용하여 반복적인 생성-비판-수정 과정을 R라운드까지 진행합니다. GPQA Diamond, AIME 2025, 그리고 LiveCodeBench V6 (폐쇄형 및 개방형 모델 모두 포함) 데이터셋에 대한 실험 결과, VPS는 다음과 같은 세 가지 주요 결과를 보여줍니다. 첫째, GPQA Diamond 데이터셋에서 GPT-5.4 (High) | GPT-5.4 (Low) 모델은 R=4일 때 94.9%의 정확도를 달성하여, 기울기 업데이트 없이 현재 최고 성능인 94.1%를 능가합니다. 둘째, AIME 2025 데이터셋에서 VPS는 약한 액터 모델의 성능을 크게 향상시켜, 11.7-26.7%에서 63.3-90.0% (최대 +63.3 포인트)로 끌어올립니다. 셋째, 동일한 연산 자원을 사용할 경우, VPS는 Reflexion 모델보다 +8.5에서 +12.1 포인트, Self-Consistency@5 모델보다 +5.0 pp (GPQA) 및 +8.3 pp (LiveCodeBench) 더 높은 성능을 보이며, 이는 비판의 세부 수준이 성능 향상의 핵심 요인임을 시사합니다. 성능은 감독 모델과 액터 모델 간의 능력 차이에 따라 달라지며 (Pearson r=0.90), 또한 언어적으로 표현하기 어려운 오류가 발생할 경우 성능이 저하됩니다. 이러한 결과는 비판의 세부 수준을 추론 성능 향상을 위한 새로운 관점으로 제시합니다.

Original Abstract

Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.

0 Citations

0 Influential

1 Altmetric

5.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!