2601.04767v1 Jan 08, 2026 cs.AI

AT$^2$PO: 트리 탐색을 통한 에이전트 턴 기반 정책 최적화

AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search

Dingwei Chen

Citations: 35

h-index: 3

Chengming Li

Citations: 56

h-index: 3

Bo Zhou

Citations: 66

h-index: 6

Zefang Zong

Citations: 681

h-index: 11

Yang Li

Citations: 3,005

h-index: 8

Qi Yi

Citations: 44

h-index: 4

Bo Qian

Citations: 1

h-index: 1

Peng Chen

Citations: 45

h-index: 3

Jie Jiang

Citations: 19

h-index: 2

LLM 에이전트는 내부 추론과 외부 도구 상호작용을 교차하여 다중 턴 작업을 처리하는 강력한 시스템으로 부상했습니다. 최근 에이전트 강화 학습(Agentic Reinforcement Learning)은 이러한 능력을 더욱 개선하기 위한 중요한 사후 학습 패러다임으로 상당한 연구적 주목을 받고 있습니다. 본 논문에서는 제한된 탐색 다양성, 희소한 기여도 할당(credit assignment), 정렬되지 않은 정책 최적화라는 세 가지 핵심 과제를 해결하는 다중 턴 에이전트 RL을 위한 통합 프레임워크인 AT$^2$PO(트리 탐색을 통한 에이전트 턴 기반 정책 최적화)를 제안합니다. AT$^2$PO는 전략적 탐색을 위한 엔트로피 유도 트리 확장(Entropy-Guided Tree Expansion)과 희소한 결과로부터 세밀한 보상 전파를 가능하게 하는 턴 단위 기여도 할당(Turn-wise Credit Assignment)을 동시에 지원하는 턴 레벨 트리 구조를 도입합니다. 이를 보완하기 위해, 우리는 정책 업데이트를 에이전트 상호작용의 자연스러운 결정 입도(granularity)와 일치시키는 턴 레벨 학습 목표인 에이전트 턴 기반 정책 최적화를 제안합니다. ATPO는 트리 탐색과 직교하며(독립적이며), 어떠한 다중 턴 RL 파이프라인에도 쉽게 통합될 수 있습니다. 7개 벤치마크에 걸친 실험 결과, 최신 베이스라인 대비 평균 최대 1.84% 포인트의 일관된 성능 향상을 보였으며, 절제 연구(ablation studies)를 통해 각 구성 요소의 유효성을 검증했습니다. 코드는 https://github.com/zzfoutofspace/ATPO 에서 확인할 수 있습니다.

Original Abstract

LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT$^2$PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT$^2$PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at https://github.com/zzfoutofspace/ATPO.

0 Citations

0 Influential

39.951858789481 Altmetric

199.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!