2602.11767v2 Feb 12, 2026 cs.AI

TSR: LLM 에이전트의 다중 턴 강화학습을 위한 궤적 탐색 롤아웃

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

Aladin Djuhera

Citations: 87

h-index: 5

S. Kadhe

Citations: 1,667

h-index: 23

Holger Boche

Citations: 128

h-index: 6

Farhan Ahmed

Citations: 92

h-index: 6

Heiko Ludwig

Citations: 164

h-index: 4

대형 언어 모델(LLM)의 발전은 다양한 작업에 걸친 반복적이고 다중 턴(multi-turn) 상호작용을 통해 에이전트를 훈련하는 데 강화학습(RL)을 사용하는 방향으로의 변화를 이끌고 있다. 그러나 보상이 종종 희소하거나 지연되고 환경이 확률적(stochastic)일 수 있어 다중 턴 강화학습은 여전히 까다로운 과제이다. 이러한 환경에서 단순한 궤적 샘플링은 활용(exploitation)을 저해하고 모드 붕괴(mode collapse)를 유발할 수 있다. 우리는 턴별 롤아웃 생성을 향상시키기 위해 테스트 타임 스케일링 아이디어를 훈련 단계에 맞춰 재구성한 접근법인 TSR(Trajectory-Search Rollouts, 궤적 탐색 롤아웃)을 제안한다. TSR은 가벼운 트리 방식의 탐색을 수행하여, 작업별 피드백을 사용해 각 턴에서 높은 점수를 받은 행동을 선택함으로써 고품질의 궤적을 구성한다. 이는 근본적인 최적화 목표를 변경하지 않으면서도 롤아웃 품질을 높이고 학습을 안정화시켜, TSR이 특정 옵티마이저에 종속되지 않도록(optimizer-agnostic) 한다. 우리는 Best-of-N, 빔 탐색, 얕은 사전 탐색(shallow lookahead search)을 적용해 TSR을 구현하고 이를 PPO 및 GRPO와 결합했다. 그 결과 훈련 연산량을 한 번만 증가시키면서도 소코반(Sokoban), FrozenLake, WebShop 작업에서 최대 15%의 성능 향상과 더욱 안정적인 학습을 달성했다. 탐색 과정을 추론 단계에서 훈련의 롤아웃 단계로 옮김으로써, TSR은 기존 프레임워크 및 기각 샘플링(rejection-sampling) 형태의 선택 방법과 상호 보완적으로 작용하며, 더 강력한 다중 턴 에이전트 학습을 위한 간단하고 일반적인 메커니즘을 제공한다.

Original Abstract

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

2 Citations

1 Influential

11.5 Altmetric

61.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!