2602.03647v1 Feb 03, 2026 cs.AI

Search-R2: Actor-Refiner 협업을 통한 검색 통합 추론 강화

Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration

Minda Hu

Citations: 308

h-index: 8

Irwin King

Citations: 254

h-index: 7

Zenan Xu

Citations: 11

h-index: 2

Licheng Zong

Citations: 454

h-index: 8

Yankai Chen

Citations: 628

h-index: 14

Bowei He

Citations: 148

h-index: 8

Chen Ma

Citations: 13

h-index: 2

Xue Liu

Citations: 43

h-index: 3

Pluto Zhou

Citations: 12

h-index: 2

Hongru Wang

The Chinese University of Hong Kong, University of Edinburgh

Citations: 2,219

h-index: 24

검색 통합 추론(Search-integrated reasoning)은 언어 에이전트가 외부 소스를 능동적으로 조회함으로써 정적인 파라미터 지식의 한계를 뛰어넘을 수 있게 합니다. 그러나 강화 학습을 통한 에이전트 훈련은 다중 스케일 기여도 할당(multi-scale credit assignment) 문제로 인해 난항을 겪고 있습니다. 기존 방법론들은 주로 희소한(sparse) 궤적 수준의 보상에 의존하여 고품질의 추론과 우연한 정답을 구별하지 못하며, 이는 중복되거나 잘못된 검색 행동을 유발합니다. 이를 해결하기 위해 우리는 훈련 중 두 구성 요소를 공동으로 최적화하고 표적 개입을 통해 추론을 강화하는 새로운 Actor-Refiner 협업 프레임워크인 Search-R2를 제안합니다. 우리의 접근 방식은 생성 과정을 초기 추론 궤적을 생산하는 'Actor'와, '절단 후 재생성(cut-and-regenerate)' 메커니즘을 통해 오류 단계를 선택적으로 진단하고 복구하는 'Meta-Refiner'로 분리합니다. 세밀한 학습 지도를 위해, 결과의 정확성과 검색된 증거의 정보 밀도를 정량화하는 조밀한 과정 보상(dense process reward)을 결합한 하이브리드 보상 설계를 도입했습니다. 이론적으로 우리는 Actor-Refiner 상호작용을 평활 혼합 정책(smoothed mixture policy)으로 공식화하여, 선택적 교정이 강력한 기준 모델 대비 확실한 성능 이득을 산출함을 증명했습니다. 다양한 일반 및 멀티홉 QA 데이터셋에 대한 광범위한 실험 결과, Search-R2는 모든 모델 규모에서 강력한 RAG 및 RL 기반 베이스라인을 일관되게 능가하며, 최소한의 오버헤드로 우수한 추론 정확도를 달성함을 입증했습니다.

Original Abstract

Search-integrated reasoning enables language agents to transcend static parametric knowledge by actively querying external sources. However, training these agents via reinforcement learning is hindered by the multi-scale credit assignment problem: existing methods typically rely on sparse, trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors. To address this, we propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention, with both components jointly optimized during training. Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories, and a Meta-Refiner, which selectively diagnoses and repairs flawed steps via a 'cut-and-regenerate' mechanism. To provide fine-grained supervision, we introduce a hybrid reward design that couples outcome correctness with a dense process reward quantifying the information density of retrieved evidence. Theoretically, we formalize the Actor-Refiner interaction as a smoothed mixture policy, proving that selective correction yields strict performance gains over strong baselines. Extensive experiments across various general and multi-hop QA datasets demonstrate that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales, achieving superior reasoning accuracy with minimal overhead.

2 Citations

1 Influential

12 Altmetric

64.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!