2602.21534v1 Feb 25, 2026 cs.AI

ARLArena: 안정적인 에이전트 기반 강화 학습을 위한 통합 프레임워크

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Wei Wang

Citations: 3,601

h-index: 8

Xiaoxuan Wang

Citations: 305

h-index: 7

Haixin Wang

Citations: 146

h-index: 5

Ruoyan Li

Citations: 12

h-index: 2

Kaiqiao Han

Citations: 305

h-index: 6

Chenyi Tong

Citations: 7

h-index: 1

Yanqiao Zhu

Citations: 30

h-index: 3

Jason Cong

Citations: 209

h-index: 8

Yizhou Sun

Citations: 259

h-index: 8

Han Zhang

Citations: 150

h-index: 6

Yidan Shi

Citations: 39

h-index: 3

Renliang Sun

Citations: 166

h-index: 4

Alexander Taylor

Citations: 8

h-index: 1

Haoran Deng

Citations: 25

h-index: 3

에이전트 기반 강화 학습(ARL)은 복잡하고 다단계의 상호 작용 작업을 해결하도록 에이전트를 훈련하는 유망한 패러다임으로 빠르게 주목받고 있습니다. 초기 결과는 고무적이었지만, ARL은 여전히 매우 불안정하며, 종종 훈련 실패로 이어집니다. 이러한 불안정성은 더 큰 환경과 더 긴 상호 작용 기간으로의 확장성을 제한하며, 알고리즘 설계 선택에 대한 체계적인 탐색을 제약합니다. 본 논문에서는 먼저 ARLArena를 제안합니다. ARLArena는 통제되고 재현 가능한 환경에서 훈련 안정성을 검토하는 안정적인 훈련 방법과 체계적인 분석 프레임워크입니다. ARLArena는 먼저 깔끔하고 표준화된 테스트베드를 구축합니다. 그런 다음 정책 그래디언트를 네 가지 핵심 설계 차원으로 분해하고 각 차원의 성능과 안정성을 평가합니다. 이러한 세밀한 분석을 통해 ARL에 대한 통합적인 관점을 제시하고, ARL의 주요 불안정 요소를 완화하도록 설계된 안정적인 에이전트 정책 최적화 방법인 SAMPO를 제안합니다. 경험적으로, SAMPO는 다양한 에이전트 기반 작업에서 일관되게 안정적인 훈련과 강력한 성능을 달성합니다. 전반적으로, 본 연구는 ARL에 대한 통합적인 정책 그래디언트 관점을 제공하고, 안정적이고 재현 가능한 LLM 기반 에이전트 훈련 파이프라인을 구축하기 위한 실질적인 지침을 제시합니다.

Original Abstract

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

7 Citations

0 Influential

4 Altmetric

27.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!