2601.21288v1 Jan 29, 2026 cs.AI

Drive-KD: 자율 주행 VLM을 위한 다중 교사 증류

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Weitong Lian

Citations: 2

h-index: 1

Zecong Tang

Citations: 6

h-index: 2

Haoran Li

Citations: 2

h-index: 1

Zixu Wang

Citations: 2

h-index: 1

Lingyi Meng

Citations: 3

h-index: 1

Tengju Ru

Citations: 2

h-index: 1

Zhejun Cui

Citations: 2

h-index: 1

Yichen Zhu

Citations: 3

h-index: 1

Hangshuo Cao

Citations: 16

h-index: 1

Qi Kang

Citations: 2

h-index: 1

Tianxing Chen

Citations: 498

h-index: 8

Yusen Qin

Citations: 208

h-index: 5

Kaixuan Wang

Citations: 8

h-index: 2

Yu Zhang

Citations: 2

h-index: 1

Yifei Wang

Citations: 2

h-index: 1

Tianjian Gao

Citations: 10

h-index: 2

자율 주행은 중요하고 안전이 필수적인 작업이며, 최근 LLM/VLM의 발전은 이 분야의 추론 및 계획에 새로운 가능성을 열었습니다. 그러나 거대 모델은 상당한 GPU 메모리를 요구하고 높은 추론 지연 시간을 보이는 반면, 기존의 지도 미세 조정(SFT)은 소형 모델의 능력 격차를 해소하는 데 종종 어려움을 겪습니다. 이러한 한계를 해결하기 위해, 본 논문에서는 자율 주행을 '인지-추론-계획'의 3요소로 분해하고 지식 증류를 통해 이러한 능력을 전이하는 프레임워크인 Drive-KD를 제안합니다. 우리는 계층별 어텐션을 증류 신호로 식별하여 베이스라인을 능가하는 능력별 단일 교사 모델을 구축합니다. 또한, 이러한 단일 교사 설정을 다중 교사 증류 프레임워크로 통합하고, 능력 간 기울기 충돌을 완화하기 위해 비대칭 기울기 투영을 도입합니다. 광범위한 평가를 통해 다양한 모델 제품군과 규모에 걸친 제안 방법의 일반화 성능을 검증했습니다. 실험 결과, 증류된 InternVL3-1B 모델은 같은 제품군의 사전 학습된 78B 모델에 비해 약 42배 적은 GPU 메모리와 약 11.4배 높은 처리량을 보이면서도 DriveBench에서 더 우수한 전반적인 성능을 달성했으며, 계획 차원에서는 GPT-5.1을 능가하여 효율적인 자율 주행 VLM을 향한 통찰력을 제공합니다.

Original Abstract

Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a "perception-reasoning-planning" triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!