2603.09203v1 Mar 10, 2026 cs.AI

평가-지향적 행동: 검색 증강 에이전트를 위한 자기 평가 기반 프로세스 보상

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

Jitao Sang

Citations: 7

h-index: 1

Jiangming Shu

Citations: 108

h-index: 4

Yuxiang Zhang

Citations: 115

h-index: 4

Yeeun Ma

Citations: 32

h-index: 1

Xueyuan Lin

Citations: 28

h-index: 1

검색 증강 에이전트는 외부 정보를 검색할 수 있지만, 다단계 추론에서의 신뢰성은 여전히 제한적입니다. 노이즈가 많은 검색은 다중 홉 질문 응답을 방해할 수 있으며, 결과만을 기반으로 한 강화 학습은 중간 단계를 최적화하기에 너무 거친 신호만을 제공합니다. 본 논문에서는 암묵적인 검색 품질 평가를 명시적인 행동으로 변환하고, 검색-평가 프로토콜을 결합하여 각 검색 후 즉시 구조화된 평가 점수를 제공함으로써, 상호 작용 경로와 일치하는 프로세스 신호를 생성하는 extsc{EvalAct} (Evaluate-as-Action)을 제안합니다. 이러한 신호를 활용하기 위해, 본 논문에서는 평가 점수에 따라 세그먼트 수준에서 이점을 재조정하는 GRPO 기반 최적화 방법인 Process-Calibrated Advantage Rescaling (PCAR)을 소개합니다. PCAR은 신뢰할 수 있는 세그먼트에 더 큰 가중치를 부여하고 불확실한 세그먼트는 보수적으로 업데이트합니다. 7개의 오픈 도메인 질문 응답 벤치마크에 대한 실험 결과, extsc{EvalAct}은 가장 높은 평균 정확도를 달성했으며, 특히 다중 홉 작업에서 큰 성능 향상을 보였습니다. 또한, 추가적인 실험을 통해 명시적인 평가 루프가 주요 개선 사항을 이끌어내며, PCAR이 일관적으로 추가적인 이점을 제공한다는 것을 확인했습니다.

Original Abstract

Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textsc{EvalAct} (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textsc{EvalAct} achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!