2603.24989v1 Mar 26, 2026 cs.RO

샘플링 기반 롤아웃 학습: R1 스타일 토큰화 트래픽 시뮬레이션 모델

Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model

Ziyan Wang

KCL

Citations: 351

h-index: 9

Peng Chen

Citations: 224

h-index: 9

Ding Li

Citations: 90

h-index: 6

Chi-Ho Li

Citations: 7

h-index: 1

Qichao Zhang

Citations: 292

h-index: 8

Zhongpu Xia

Citations: 413

h-index: 9

Guizhen Yu

Citations: 425

h-index: 11

인간 운전 데이터를 기반으로 다양한 고품질 트래픽 시뮬레이션을 학습하는 것은 자율 주행 시스템 평가에 매우 중요합니다. 최근 대규모 언어 모델(LLM)에서 널리 사용되는 다음 토큰 예측(NTP) 패러다임이 트래픽 시뮬레이션에 적용되어 지도 학습(SFT)을 통해 반복적인 개선을 이루고 있습니다. 그러나 이러한 방법은 잠재적으로 유용한 움직임 토큰에 대한 적극적인 탐색을 제한하며, 특히 최적화되지 않은 영역에서 이러한 문제가 두드러집니다. 엔트로피 패턴은 움직임 토큰의 불확실성에 기반한 탐색을 가능하게 하는 유망한 관점을 제공합니다. 이러한 아이디어에 기반하여, 우리는 움직임 토큰 엔트로피 패턴을 기반으로 강화 학습을 탐구하는 초기 시도인 새로운 토큰화 트래픽 시뮬레이션 정책인 R1Sim을 제안하고, 다양한 움직임 토큰이 시뮬레이션 결과에 미치는 영향을 체계적으로 분석합니다. 구체적으로, 우리는 이전에 간과되었지만 높은 불확실성과 잠재력을 가진 움직임 토큰에 초점을 맞추는 엔트로피 기반 적응형 샘플링 메커니즘을 도입합니다. 또한, 안전을 고려한 보상 설계를 통해 그룹 상대 정책 최적화(GRPO)를 사용하여 움직임 행동을 최적화합니다. 전반적으로, 이러한 구성 요소는 다양한 고불확실성 샘플링과 그룹별 비교 추정을 통해 균형 잡힌 탐색-활용 균형을 가능하게 하여 현실적이고 안전하며 다양한 다중 에이전트 행동을 구현합니다. Waymo Sim Agent 벤치마크에서의 광범위한 실험 결과, R1Sim은 최첨단 방법과 경쟁력 있는 성능을 달성하는 것으로 나타났습니다.

Original Abstract

Learning diverse and high-fidelity traffic simulations from human driving demonstrations is crucial for autonomous driving evaluation. The recent next-token prediction (NTP) paradigm, widely adopted in large language models (LLMs), has been applied to traffic simulation and achieves iterative improvements via supervised fine-tuning (SFT). However, such methods limit active exploration of potentially valuable motion tokens, particularly in suboptimal regions. Entropy patterns provide a promising perspective for enabling exploration driven by motion token uncertainty. Motivated by this insight, we propose a novel tokenized traffic simulation policy, R1Sim, which represents an initial attempt to explore reinforcement learning based on motion token entropy patterns, and systematically analyzes the impact of different motion tokens on simulation outcomes. Specifically, we introduce an entropy-guided adaptive sampling mechanism that focuses on previously overlooked motion tokens with high uncertainty yet high potential. We further optimize motion behaviors using Group Relative Policy Optimization (GRPO), guided by a safety-aware reward design. Overall, these components enable a balanced exploration-exploitation trade-off through diverse high-uncertainty sampling and group-wise comparative estimation, resulting in realistic, safe, and diverse multi-agent behaviors. Extensive experiments on the Waymo Sim Agent benchmark demonstrate that R1Sim achieves competitive performance compared to state-of-the-art methods.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!