2606.11119v1 Jun 09, 2026 cs.LG

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Saiyong Yang

Citations: 104

h-index: 5

Xingzhong Xu

Citations: 270

h-index: 2

Weijie Liu

Citations: 94

h-index: 4

Yun Qu

Tsinghua University

Citations: 308

h-index: 11

Yixiu Mao

Citations: 236

h-index: 9

Heming Zou

Citations: 79

h-index: 6

Xiangyang Ji

Citations: 232

h-index: 10

Yuhang Jiang

Citations: 269

h-index: 10

Lizhou Cai

Citations: 1

h-index: 1

Qi Wang

Citations: 19

h-index: 2

Runsi Peng

Citations: 16

h-index: 1

Kaixuan Yang

Citations: 0

h-index: 0

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.

1 Citations

0 Influential

5.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!