2602.00845v1 Jan 31, 2026 cs.AI

합성 의미적 정보 이득 보상을 활용한 검색 기반 에이전트 추론 최적화

Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

Senkang Hu

Citations: 499

h-index: 12

Yu Guo

Citations: 31

h-index: 4

S. Kwong

Citations: 124

h-index: 6

Yong Dai

Citations: 31

h-index: 3

Yihang Tao

Citations: 141

h-index: 8

Yuguang Fang

Citations: 806

h-index: 17

Zhengru Fang

Citations: 473

h-index: 11

Yuzhi Zhao

Citations: 95

h-index: 6

에이전트 추론은 대규모 추론 모델(LRM)이 외부 지식을 동적으로 획득할 수 있게 해주지만, 조밀하고 원칙적인 보상 신호의 부재로 인해 검색 과정을 최적화하는 것은 여전히 어려운 과제로 남아 있습니다. 본 논문에서는 합성 의미적 정보 이득 보상을 통해 효과적인 정보 탐색을 유도하는 통합 프레임워크인 InfoReasoner를 소개합니다. 이론적으로 우리는 정보 이득을 모델의 믿음 상태에 대한 불확실성 감소로 재정의하고, 비음성(non-negativity), 텔레스코핑 가법성(telescoping additivity), 채널 단조성(channel monotonicity)을 포함한 보장을 확립합니다. 실용적인 측면에서는 수동 검색 주석 없이 확장 가능한 최적화를 가능하게 하기 위해, 양방향 텍스트 함의를 통한 의미적 클러스터링을 사용하여 모델의 출력 분포에서 직접 정보 이득을 계산하는 출력 인식 내재적 추정기를 제안합니다. 이러한 내재적 보상은 정책이 인식적 진보를 극대화하도록 유도하며, 그룹 상대적 정책 최적화(GRPO)를 통한 효율적인 학습을 가능하게 합니다. 7가지 질의응답 벤치마크에 대한 실험 결과, InfoReasoner는 강력한 검색 증강 베이스라인 모델들을 일관되게 능가하였으며, 최대 5.4%의 평균 정확도 향상을 달성했습니다. 우리의 연구는 검색을 활용한 에이전트 추론을 향한 이론적 근거가 있고 확장 가능한 경로를 제시합니다.

Original Abstract

Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduce InfoReasoner, a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model's belief states, establishing guarantees, including non-negativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model's output distributions using semantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimxization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval.

4 Citations

0 Influential

8.5 Altmetric

46.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!