2602.11767v1 Feb 12, 2026 cs.AI

TSR: LLM 에이전트의 멀티 턴 강화학습을 위한 궤적 탐색 롤아웃

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

Aladin Djuhera

Citations: 87

h-index: 5

S. Kadhe

Citations: 1,667

h-index: 23

Holger Boche

Citations: 128

h-index: 6

Farhan Ahmed

Citations: 92

h-index: 6

Heiko Ludwig

Citations: 164

h-index: 4

거대 언어 모델(LLM)의 발전은 여러 작업 전반에 걸친 반복적인 멀티 턴 상호작용으로부터 에이전트를 훈련시키기 위해 강화학습(RL)을 사용하는 방향으로의 전환을 이끌고 있다. 그러나 보상이 희소하거나 지연되는 경우가 많고 환경이 확률적일 수 있기 때문에, 멀티 턴 강화학습은 여전히 난제로 남아 있다. 이러한 상황에서 단순한 궤적 샘플링은 활용(exploitation)을 저해하고 모드 붕괴를 유발할 수 있다. 우리는 턴별 롤아웃 생성을 개선하기 위해 테스트 시간 스케일링 아이디어를 재구성한 훈련 단계 접근 방식인 TSR(궤적 탐색 롤아웃)을 제안한다. TSR은 작업별 피드백을 사용하여 각 턴마다 높은 점수의 행동을 선택함으로써 고품질 궤적을 구성하는 경량화된 트리 방식의 탐색을 수행한다. 이는 근본적인 최적화 목표는 변경하지 않은 채로 롤아웃 품질을 개선하고 학습을 안정화시켜, TSR이 특정 옵티마이저에 종속되지 않도록 한다. 우리는 Best-of-N, 빔(beam), 얕은 룩어헤드 탐색으로 TSR을 구현하고 이를 PPO 및 GRPO와 결합하여, 훈련 연산량의 일회성 증가만으로 소코반(Sokoban), 프로즌레이크(FrozenLake), 웹샵(WebShop) 작업에서 최대 15%의 성능 향상과 더 안정적인 학습을 달성했다. 탐색 과정을 추론 단계에서 훈련의 롤아웃 단계로 이동시킴으로써, TSR은 더 강력한 멀티 턴 에이전트 학습을 위한 단순하고 범용적인 메커니즘을 제공하며, 이는 기존 프레임워크 및 기각 샘플링 방식의 선택 방법과 상호 보완적이다.

Original Abstract

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

2 Citations

1 Influential

11.5 Altmetric

61.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!