2605.02178v1 May 04, 2026 cs.AI

T²PO: 불확실성 기반 탐색 제어를 통한 안정적인 다중 회전 에이전트 강화 학습

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Haixin Wang

Citations: 146

h-index: 5

Chenwei Zhang

Citations: 179

h-index: 7

Yizhou Sun

Citations: 16

h-index: 2

Hejie Cui

Citations: 13

h-index: 2

Xin Liu

Citations: 6

h-index: 2

Shijie Geng

Citations: 66

h-index: 3

Nasser Zalmout

Citations: 1,077

h-index: 16

Shuowei Jin

Citations: 177

h-index: 7

Xin-Yu Zhang

Citations: 56

h-index: 2

Zhenyu Shi

Citations: 2

h-index: 1

최근 다중 회전 강화 학습(RL) 분야의 발전은 복잡한 상호작용 작업에서 추론 LLM의 성능을 크게 향상시켰습니다. 세분화된 신용 할당 및 경로 필터링과 같은 안정화 기술의 발전에도 불구하고, 불안정성은 여전히 만연하며 종종 학습 실패로 이어집니다. 우리는 이러한 불안정성이 다중 회전 환경에서의 비효율적인 탐색에서 비롯된다고 주장합니다. 즉, 정책은 불확실성을 줄이거나 작업 진행을 촉진하지 못하는 낮은 정보의 행동을 지속적으로 생성합니다. 이 문제를 해결하기 위해, 우리는 불확실성을 고려한 프레임워크인 토큰 및 회전 수준 정책 최적화(T²PO)를 제안합니다. T²PO는 세분화된 수준에서 탐색을 명시적으로 제어합니다. 토큰 수준에서, T²PO는 불확실성 변화를 모니터링하고, 경계값 이하로 감소할 경우 '사고' 개입을 유발합니다. 회전 수준에서, T²PO는 미미한 탐색 진행을 보이는 상호작용을 식별하고, 이러한 회전을 동적으로 재샘플링하여 낭비되는 시행을 방지합니다. 우리는 T²PO를 WebShop, ALFWorld, Search QA 등 다양한 환경에서 평가하여, 향상된 탐색 효율성을 통해 학습 안정성과 성능 향상에 상당한 효과가 있음을 보여줍니다. 코드: https://github.com/WillDreamer/T2PO

Original Abstract

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (T$^2$PO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, T$^2$PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T$^2$PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T$^2$PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: https://github.com/WillDreamer/T2PO.

1 Citations

0 Influential

46.187930798632 Altmetric

231.9 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!