2601.04714v1 Jan 08, 2026 cs.AI

ThinkDrive: 자율 주행을 위한 Chain-of-Thought 유도 점진적 강화학습 미세 조정

ThinkDrive: Chain-of-Thought Guided Progressive Reinforcement Learning Fine-Tuning for Autonomous Driving

Chang Zhao

Citations: 58

h-index: 3

Zheming Yang

Citations: 44

h-index: 5

Yunqing Hu

Citations: 20

h-index: 3

Qi Guo

Citations: 73

h-index: 3

Zijian Wang

Citations: 34

h-index: 1

Pengcheng Li

Citations: 23

h-index: 3

Wen Ji

Citations: 197

h-index: 6

대규모 언어 모델(LLM) 기술의 급속한 발전으로 자율 주행 분야에서의 활용이 점차 확대되고 있습니다. 그러나 기존 방법들은 비구조적인 추론, 낮은 일반화 성능, 인간 운전 의도와의 불일치라는 문제점을 안고 있습니다. Chain-of-Thought (CoT) 추론이 의사 결정의 투명성을 향상시키긴 하지만, 기존의 지도 미세 조정(SFT)은 그 잠재력을 온전히 활용하지 못하며, 강화학습(RL) 접근법은 불안정성과 부족한 추론 깊이 문제에 직면해 있습니다. 이에 우리는 명시적 추론과 난이도 인식 적응형 정책 최적화를 결합한 자율 주행용 CoT 유도 점진적 RL 미세 조정 프레임워크인 ThinkDrive를 제안합니다. 우리의 방법은 2단계 훈련 전략을 채택하고 있습니다. 첫째, CoT 설명을 활용하여 SFT를 수행합니다. 둘째, 샘플 복잡도에 기반하여 학습 강도를 동적으로 조절하는 난이도 인식 적응형 정책 최적화기를 통해 점진적 RL을 적용합니다. 공개 데이터셋을 이용해 평가한 결과, ThinkDrive는 강력한 RL 베이스라인 모델들을 exam, easy-exam, accuracy 지표에서 각각 1.45%, 1.95%, 1.01% 앞서는 것으로 나타났습니다. 또한, 본 방법론으로 학습된 20억(2B) 파라미터 모델은 훨씬 거대한 GPT-4o보다 exam 지표에서 3.28% 더 우수한 성능을 기록했습니다.

Original Abstract

With the rapid advancement of large language models (LLMs) technologies, their application in the domain of autonomous driving has become increasingly widespread. However, existing methods suffer from unstructured reasoning, poor generalization, and misalignment with human driving intent. While Chain-of-Thought (CoT) reasoning enhances decision transparency, conventional supervised fine-tuning (SFT) fails to fully exploit its potential, and reinforcement learning (RL) approaches face instability and suboptimal reasoning depth. We propose ThinkDrive, a CoT guided progressive RL fine-tuning framework for autonomous driving that synergizes explicit reasoning with difficulty-aware adaptive policy optimization. Our method employs a two-stage training strategy. First, we perform SFT using CoT explanations. Then, we apply progressive RL with a difficulty-aware adaptive policy optimizer that dynamically adjusts learning intensity based on sample complexity. We evaluate our approach on a public dataset. The results show that ThinkDrive outperforms strong RL baselines by 1.45%, 1.95%, and 1.01% on exam, easy-exam, and accuracy, respectively. Moreover, a 2B-parameter model trained with our method surpasses the much larger GPT-4o by 3.28% on the exam metric.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!