2603.21383v1 Mar 22, 2026 cs.AI

PivotRL: 낮은 컴퓨팅 비용으로 높은 정확도의 에이전트 기반 추가 학습

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

Baihe Huang

Citations: 77

h-index: 5

Jian Zhang

Citations: 62

h-index: 3

Junkeun Yi

Citations: 91

h-index: 3

Damon Mosk-Aoyama

Citations: 683

h-index: 8

Ritu Gala

Citations: 91

h-index: 3

Charles Wang

Citations: 0

h-index: 0

Sugam Devare

Citations: 57

h-index: 3

Khushi Bhardwaj

Citations: 71

h-index: 3

Abhibha Gupta

Citations: 47

h-index: 3

Oleksii Kuchaiev

Citations: 48

h-index: 3

Jiantao Jiao

Citations: 739

h-index: 14

V. Srinivasan

Citations: 167

h-index: 5

장기적인 에이전트 기반 작업에 대한 추가 학습은 컴퓨팅 효율성과 일반화 성능 간의 균형이 중요한 문제입니다. 지도 미세 조정(SFT)은 컴퓨팅 효율적이지만, 종종 외부 데이터(OOD)에 대한 성능 저하가 발생합니다. 반면, 엔드투엔드 강화 학습(E2E RL)은 OOD 성능을 유지하지만, 온정책 롤아웃 횟수가 많아 높은 컴퓨팅 비용이 발생합니다. 본 논문에서는 SFT의 컴퓨팅 효율성과 E2E RL의 OOD 정확성을 결합하는 새로운 프레임워크인 PivotRL을 소개합니다. PivotRL은 다음 두 가지 핵심 메커니즘에 의존합니다. 첫째, 로컬 온정책 롤아웃을 실행하고, 결과의 변동성이 높은 정보적인 중간 단계를 '피벗(pivot)'으로 식별하여 활용합니다. 둘째, SFT 데이터의 엄격한 문자열 일치 대신 기능적으로 동등한 행동에 대한 보상을 사용합니다. 우리는 이러한 메커니즘이 자연 그라디언트 노름이 높은 강력한 학습 신호를 유도하며, 동시에 학습 작업과 관련 없는 행동에 대한 정책 확률 순서를 최대한 보존한다는 것을 이론적으로 증명합니다. 동일한 데이터에 대한 표준 SFT와 비교하여, PivotRL은 네 가지 에이전트 기반 도메인에서 평균 +4.17% 더 높은 내부 도메인 정확도를, 비에이전트 작업에서 +10.04% 더 높은 OOD 정확도를 달성했습니다. 특히, 에이전트 기반 코딩 작업에서 PivotRL은 E2E RL과 경쟁력 있는 정확도를 4배 적은 롤아웃 횟수로 달성했습니다. PivotRL은 NVIDIA의 Nemotron-3-Super-120B-A12B에 채택되어, 대규모 에이전트 기반 추가 학습의 핵심 기술로 사용되고 있습니다.

Original Abstract

Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA's Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!