2601.22154v1 Jan 29, 2026 cs.AI

에이전트를 위한 추론 보상 모델 탐구

Exploring Reasoning Reward Model for Agents

Kaixuan Fan

Citations: 73

h-index: 3

Kaituo Feng

Citations: 626

h-index: 9

Manyuan Zhang

Citations: 101

h-index: 6

Tianshuo Peng

Citations: 469

h-index: 7

Zhixun Li

Citations: 22

h-index: 3

Yilei Jiang

Citations: 147

h-index: 7

Peng Pei

Citations: 152

h-index: 8

Xunliang Cai

Citations: 118

h-index: 7

Xiangyu Yue

Citations: 257

h-index: 9

Shuang Chen

Citations: 24

h-index: 2

에이전트 강화학습(Agentic RL)은 에이전트가 복잡한 추론과 도구 사용을 수행할 수 있도록 하는 데 있어 괄목할 만한 성과를 거두었습니다. 그러나 대부분의 방법론은 여전히 학습을 위해 희소한 결과 기반 보상(sparse outcome-based reward)에 의존하고 있습니다. 이러한 피드백은 중간 추론 과정의 품질을 구별하지 못하여, 최적화되지 않은 학습 결과로 이어집니다. 본 논문에서는 에이전트 궤적에 대해 구조화된 피드백을 생성하는 다각적 보상 모델인 에이전트 추론 보상 모델(Agent-RRM)을 소개합니다. 이 모델은 (1) 명시적 추론 과정(trace), (2) 추론 결함을 강조하여 개선 지침을 제공하는 집중적 비평, (3) 프로세스 성능을 평가하는 종합 점수를 포함합니다. 이러한 신호들을 활용하여, 우리는 Reagent-C(텍스트 증강 개선), Reagent-R(보상 증강 지침), Reagent-U(통합 피드백 통합)라는 세 가지 통합 전략을 체계적으로 조사합니다. 12개의 다양한 벤치마크에 걸친 광범위한 평가 결과, Reagent-U는 GAIA에서 43.7%, WebWalkerQA에서 46.2%를 달성하는 등 상당한 성능 향상을 보였으며, 이는 우리의 추론 보상 모델과 학습 방식의 유효성을 입증합니다. 향후 연구를 촉진하기 위해 코드, 모델, 데이터셋이 모두 공개됩니다.

Original Abstract

Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.

3 Citations

0 Influential

4.5 Altmetric

25.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!