2603.18859v1 Mar 19, 2026 cs.AI

RewardFlow: 상태 그래프 기반, 토폴로지 정보를 활용한 보상 전파 방법 - 대규모 언어 모델 기반 에이전트 강화 학습

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Dahai Yu

Citations: 3,240

h-index: 2

Michael K. Ng

Citations: 0

h-index: 0

Zhanke Zhou

Citations: 931

h-index: 14

Jiangchao Yao

Citations: 1,165

h-index: 16

Xiao Feng

Citations: 78

h-index: 5

Bo Han

Citations: 203

h-index: 8

Jiaqi Fan

Citations: 3

h-index: 1

Kaiyang Li

Citations: 16

h-index: 3

강화 학습(RL)은 외부 환경과의 상호작용을 통해 대규모 언어 모델(LLM)의 에이전트 추론 능력을 향상시킬 수 있는 잠재력을 가지고 있습니다. 그러나, 최종 보상의 희소성은 세분화된, 상태 수준의 최적화를 어렵게 만듭니다. 프로세스 보상 모델링은 유망한 대안이지만, 전용 보상 모델을 학습하는 데는 상당한 계산 비용과 확장성 문제가 따릅니다. 이러한 문제점을 해결하기 위해, 우리는 에이전트 추론 작업에 적합한 상태 수준의 보상을 추정하는 경량화된 방법인 RewardFlow를 제안합니다. RewardFlow는 추론 경로 내의 상태들의 고유한 토폴로지 구조를 활용하여 상태 그래프를 구성합니다. 이를 통해 성공에 대한 상태별 기여도를 분석하고, 토폴로지 정보를 활용한 그래프 전파를 통해 기여도를 정량화하고 객관적인 상태 수준의 보상을 생성합니다. RewardFlow를 강화 학습 최적화를 위한 밀집 보상으로 사용할 때, 네 가지 에이전트 추론 벤치마크에서 기존 강화 학습 방법보다 훨씬 뛰어난 성능, 안정성 및 학습 효율성을 보여줍니다. RewardFlow의 구현 코드는 https://github.com/tmlr-group/RewardFlow 에서 공개적으로 이용할 수 있습니다.

Original Abstract

Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

0 Citations

0 Influential

34.931471805599 Altmetric

174.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!