2602.10458v1 Feb 11, 2026 cs.AI

Found-RL: 자율 주행을 위한 파운데이션 모델로 강화된 강화 학습

Found-RL: foundation model-enhanced reinforcement learning for autonomous driving

Zihao Sheng

Citations: 488

h-index: 12

Zilin Huang

Citations: 481

h-index: 12

Samuel Labi

Citations: 95

h-index: 5

Yansong Qu

Citations: 116

h-index: 6

Jiancong Chen

Citations: 10

h-index: 2

Yuhao Luo

Citations: 9

h-index: 2

Tianyi Wang

Citations: 13

h-index: 2

Sikai Chen

Citations: 595

h-index: 14

Yiheng Feng

Citations: 161

h-index: 7

강화 학습(RL)은 종단간 자율 주행(AD)을 위한 주류 패러다임으로 부상했습니다. 그러나 RL은 복잡한 시나리오에서 샘플 비효율성과 의미론적 해석 가능성의 부족이라는 한계를 가집니다. 파운데이션 모델, 특히 시각-언어 모델(VLM)은 풍부하고 문맥을 인식하는 지식을 제공하여 이러한 문제를 완화할 수 있지만, 높은 추론 지연 시간으로 인해 고빈도 RL 훈련 루프에 적용하기 어렵습니다. 이러한 간극을 좁히기 위해, 우리는 파운데이션 모델을 활용하여 자율 주행용 RL을 효율적으로 강화하는 플랫폼인 Found-RL을 제안합니다. 핵심 혁신은 무거운 VLM 추론을 시뮬레이션 루프와 분리하는 비동기 배치 추론 프레임워크로, 이는 지연 시간 병목 현상을 효과적으로 해결하여 실시간 학습을 지원합니다. 우리는 전문가 수준의 VLM 행동 제안을 RL 정책으로 효과적으로 증류하기 위해 가치-마진 정규화(VMR)와 이점-가중 행동 유도(AWAG)라는 다양한 지도 메커니즘을 도입합니다. 또한, 밀집 보상(dense reward) 형성을 위해 높은 처리량의 CLIP을 채택했습니다. 우리는 이산화된 속도/제어 명령에 따라 프롬프트를 조정하고 문맥별 행동-앵커 점수로부터 정규화된 마진 기반 보너스를 산출하는 '조건부 대조 행동 정렬(Conditional Contrastive Action Alignment)'을 통해 CLIP의 동적 정보 인식 한계(dynamic blindness)를 해결합니다. Found-RL은 미세 조정된 VLM 통합을 위한 종단간 파이프라인을 제공하며, 경량 RL 모델이 실시간 추론(약 500 FPS)을 유지하면서도 수십억 매개변수의 VLM에 준하는 성능을 달성할 수 있음을 보여줍니다. 코드, 데이터 및 모델은 https://github.com/ys-qu/found-rl 에서 공개될 예정입니다.

Original Abstract

Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision-Language Models (VLMs), can mitigate this by offering rich, context-aware knowledge, yet their high inference latency hinders deployment in high-frequency RL training loops. To bridge this gap, we present Found-RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real-time learning. We introduce diverse supervision mechanisms: Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to effectively distill expert-like VLM action suggestions into the RL policy. Additionally, we adopt high-throughput CLIP for dense reward shaping. We address CLIP's dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin-based bonus from context-specific action-anchor scoring. Found-RL provides an end-to-end pipeline for fine-tuned VLM integration and shows that a lightweight RL model can achieve near-VLM performance compared with billion-parameter VLMs while sustaining real-time inference (approx. 500 FPS). Code, data, and models will be publicly available at https://github.com/ys-qu/found-rl.

1 Citations

0 Influential

27 Altmetric

136.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!