2602.06554v1 Feb 06, 2026 cs.AI

SeeUPO: 수렴이 보장되는 시퀀스 수준의 에이전트 강화학습

SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

Tianyi Hu

Citations: 10

h-index: 2

Qingxu Fu

Citations: 48

h-index: 2

Yanxi Chen

Citations: 244

h-index: 8

Zhaoyang Liu

Citations: 125

h-index: 4

Bolin Ding

Citations: 162

h-index: 7

강화학습(RL)은 거대언어모델(LLM) 기반 AI 에이전트를 훈련하는 지배적인 패러다임으로 부상했습니다. 그러나 기존의 백본 RL 알고리즘들은 에이전트 시나리오, 특히 멀티 턴 환경에서 검증된 수렴 보장이 부족하며, 이는 훈련 불안정성과 최적 정책으로의 수렴 실패를 초래할 수 있습니다. 본 논문에서는 싱글/멀티 턴 시나리오에서 다양한 정책 업데이트 메커니즘과 어드밴티지 추정 방법의 조합이 수렴 특성에 미치는 영향을 체계적으로 분석합니다. 연구 결과, GRAE(Group Relative Advantage Estimation)를 적용한 REINFORCE는 비할인(undiscounted) 조건에서 전역 최적해로 수렴할 수 있지만, PPO와 GRAE의 조합은 PPO 고유의 단조 향상(monotonic improvement) 특성을 깨뜨린다는 것을 확인했습니다. 또한, 우리는 주류 백본 RL 알고리즘들이 멀티 턴 시나리오에서 '크리틱 프리(critic-free)' 특성과 '수렴 보장'을 동시에 달성할 수 없음을 입증합니다. 이를 해결하기 위해, 멀티 턴 상호작용에 대해 수렴이 보장되는 크리틱 프리 접근 방식인 SeeUPO(Sequence-level Sequential Update Policy Optimization)를 제안합니다. SeeUPO는 멀티 턴 상호작용을 순차적으로 실행되는 멀티 에이전트 밴딧 문제로 모델링합니다. 실행 역순으로 진행되는 턴 단위의 순차적 정책 업데이트를 통해, 이 방법은 역방향 귀납법(backward induction)을 통한 단조 향상과 전역 최적해로의 수렴을 보장합니다. AppWorld와 BFCL v4에서의 실험 결과, SeeUPO는 기존 백본 알고리즘 대비 상당한 개선을 보였습니다. 구체적으로 벤치마크 평균 기준 Qwen3-14B에서 43.3%-54.6%, Qwen2.5-14B에서 24.1%-41.9%의 상대적 성능 향상을 기록했으며, 우수한 훈련 안정성도 함께 확인되었습니다.

Original Abstract

Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO's substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability.

4 Citations

0 Influential

4 Altmetric

24.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!