2604.03675v1 Apr 04, 2026 cs.AI

PRAISE: 접두사 기반 롤아웃 재사용을 통한 에이전트 기반 검색 학습

PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training

Yan Gao

Citations: 38

h-index: 4

Yiqun Chen

Citations: 156

h-index: 6

Erhan Zhang

Citations: 74

h-index: 4

Jiaxin Mao

Citations: 130

h-index: 5

Xiaochi Wei

Citations: 6

h-index: 2

Wei Yang

Citations: 27

h-index: 3

Zechun Niu

Citations: 18

h-index: 3

Yi Wu

Citations: 19

h-index: 3

Yao Hu

Citations: 13

h-index: 2

에이전트 기반 검색에서는 대규모 언어 모델(LLM)이 다중 홉 질의 응답(QA)과 같은 복잡한 작업을 위해 다단계 검색 및 추론을 수행하도록 학습됩니다. 그러나 현재 검색 기반 강화 학습(RL) 방법은 두 가지 주요 한계점을 가지고 있습니다. 첫째, 학습 과정에서 장기적인 롤아웃 정보가 충분히 활용되지 않으며, 둘째, 일반적으로 감독 신호가 최종 답변에서만 제공되어 심각한 보상 희소성을 야기합니다. 본 논문에서는 에이전트 기반 검색의 데이터 효율성과 보상 할당을 향상시키는 프레임워크인 '접두사 기반 롤아웃 재사용을 통한 에이전트 기반 검색 학습(PRAISE)'을 제안합니다. PRAISE는 완전한 검색 경로에서 다양한 검색 단계의 접두사 상태를 추출하고, 이를 통해 중간 답변을 얻어내며, 이러한 접두사들을 추가적인 학습 경로 구축 및 접두사 간 성능 차이를 기반으로 한 단계별 보상 도출에 활용합니다. 본 방법은 검색 정책 학습 및 접두사 답변 평가를 위한 단일 모델을 공유하여, 추가적인 인간 어노테이션이나 별도의 보상 모델 없이 공동 최적화를 가능하게 합니다. 다중 홉 QA 벤치마크 실험 결과, PRAISE는 강력한 기준 모델보다 일관되게 성능 향상을 보였습니다.

Original Abstract

In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!