2604.08178v1 Apr 09, 2026 cs.AI

계획 기반 에이전트 정렬: 경로 수준 보상 모델링을 위한 벤치마크

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

Yulan Hu

Citations: 51

h-index: 4

Zheng Pan

Citations: 35

h-index: 2

Xin Li

Citations: 3

h-index: 1

Lan-Zhe Guo

Citations: 25

h-index: 2

Wenjing Yang

Citations: 2

h-index: 1

Jiaxuan Wang

Citations: 3

h-index: 1

전통적인 인간 피드백 기반 강화 학습(RLHF)에서, 보상 모델(RM)은 모델 정렬을 위한 핵심적인 신호 제공자 역할을 합니다. 대규모 언어 모델이 자율적인 도구 사용 및 복잡한 추론 능력을 갖춘 에이전트 시스템으로 진화함에 따라, 보상 모델링 패러다임은 전례 없는 과제에 직면하게 됩니다. 특히, 도구 통합 환경 내에서 RM의 기능을 평가하기 위해 특별히 설계된 벤치마크의 부족이 가장 큰 문제입니다. 이러한 격차를 해소하기 위해, 우리는 경로 수준 선호도 벤치마크인 Plan-RewardBench를 제시합니다. 이 벤치마크는 복잡한 도구 사용 시나리오에서 평가자가 선호하는 에이전트 경로와 주의를 분산시키는 경로를 얼마나 잘 구별하는지를 평가하도록 설계되었습니다. Plan-RewardBench는 네 가지 대표적인 작업 유형을 포함합니다. (i) 안전 거부, (ii) 도구 관련성/불가용성, (iii) 복잡한 계획, 그리고 (iv) 강력한 오류 복구. 이 벤치마크는 검증된 긍정적인 경로와 다중 모델 자연 생성, 규칙 기반 변환 및 최소 편집 LLM 변환을 통해 구성된 혼동을 유발하는 부정적인 경로로 구성되어 있습니다. 우리는 생성형, 판별형 및 LLM-as-Judge 모델을 포함한 대표적인 RM을 통일된 쌍대 비교 프로토콜을 사용하여 벤치마킹하고, 다양한 경로 길이와 작업 범주에 따른 정확도 추세를 보고합니다. 또한, 일반적인 실패 모드에 대한 진단 분석을 제공합니다. 우리의 결과는 세 가지 평가 모델 모두 상당한 어려움을 겪고 있으며, 특히 장기 경로에서 성능이 급격히 저하된다는 것을 보여줍니다. 이는 에이전트 기반, 경로 수준 보상 모델링에 대한 전문적인 훈련의 필요성을 강조합니다. 궁극적으로, Plan-RewardBench는 실질적인 평가 도구이자 에이전트 기반 계획 선호도 데이터를 구축하기 위한 재사용 가능한 청사진 역할을 하는 것을 목표로 합니다.

Original Abstract

In classical Reinforcement Learning from Human Feedback (RLHF), Reward Models (RMs) serve as the fundamental signal provider for model alignment. As Large Language Models evolve into agentic systems capable of autonomous tool invocation and complex reasoning, the paradigm of reward modeling faces unprecedented challenges--most notably, the lack of benchmarks specifically designed to assess RM capabilities within tool-integrated environments. To address this gap, we present Plan-RewardBench, a trajectory-level preference benchmark designed to evaluate how well judges distinguish preferred versus distractor agent trajectories in complex tool-using scenarios. Plan-RewardBench covers four representative task families -- (i) Safety Refusal, (ii) Tool-Irrelevance / Unavailability, (iii) Complex Planning, and (iv) Robust Error Recovery -- comprising validated positive trajectories and confusable hard negatives constructed via multi-model natural rollouts, rule-based perturbations, and minimal-edit LLM perturbations. We benchmark representative RMs (generative, discriminative, and LLM-as-Judge) under a unified pairwise protocol, reporting accuracy trends across varying trajectory lengths and task categories. Furthermore, we provide diagnostic analyses of prevalent failure modes. Our results reveal that all three evaluator families face substantial challenges, with performance degrading sharply on long-horizon trajectories, underscoring the necessity for specialized training in agentic, trajectory-level reward modeling. Ultimately, Plan-RewardBench aims to serve as both a practical evaluation suite and a reusable blueprint for constructing agentic planning preference data.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!