2602.00993v1 Feb 01, 2026 cs.RO

HERMES: 비전-언어 모델을 활용한 다중 모드 통합 자율 주행 시스템: 롱테일 환경에서의 위험 인지 기반 통합 시스템

HERMES: A Holistic End-to-End Risk-Aware Multimodal Embodied System with Vision-Language Models for Long-Tail Autonomous Driving

Rui Gan

Citations: 98

h-index: 5

Weizhe Tang

Citations: 9

h-index: 2

Junwei You

Citations: 165

h-index: 6

Jiaxi Liu

Citations: 55

h-index: 3

Zhaoying Wang

Citations: 20

h-index: 3

Zilin Huang

Citations: 74

h-index: 4

Fengchen Wei

Citations: 35

h-index: 3

Bin Ran

Citations: 8

h-index: 2

최근 자율 주행 모델은 의미 이해를 위한 대규모 비전-언어 모델로부터 많은 이점을 얻고 있지만, 롱테일 조건 하에서 안전하고 정확한 작동을 보장하는 것은 여전히 어려운 과제입니다. 이러한 어려움은 특히 다양한 도로 사용자가 복잡하고 불확실한 조건 하에서 자율 주행 차량과 상호 작용하는 롱테일 혼합 교통 시나리오에서 더욱 두드러집니다. 본 논문에서는 명시적인 롱테일 위험 정보를 경로 계획에 주입하도록 설계된 통합적인 위험 인지 기반 통합 다중 모드 자율 주행 프레임워크인 HERMES를 제안합니다. HERMES는 파운데이션 모델 기반 주석 파이프라인을 사용하여 구조화된 롱테일 장면 컨텍스트와 롱테일 계획 컨텍스트를 생성하며, 여기에는 위험 중심 단서, 기동 의도 및 안전 선호도가 포함됩니다. 이러한 정보는 엔드 투 엔드 계획을 안내하는 데 사용됩니다. 또한 HERMES는 다중 시점 인식, 과거 동작 정보 및 의미 기반 지침을 융합하는 Tri-Modal Driving Module을 도입하여 롱테일 시나리오에서 위험 인지 기반의 정확한 경로 계획을 보장합니다. 실제 롱테일 데이터 세트에 대한 실험 결과, HERMES는 롱테일 혼합 교통 시나리오에서 대표적인 엔드 투 엔드 및 VLM 기반 모델보다 일관되게 우수한 성능을 보였습니다. 또한, 제거 실험을 통해 주요 구성 요소의 상호 보완적인 기여를 확인했습니다.

Original Abstract

End-to-end autonomous driving models increasingly benefit from large vision--language models for semantic understanding, yet ensuring safe and accurate operation under long-tail conditions remains challenging. These challenges are particularly prominent in long-tail mixed-traffic scenarios, where autonomous vehicles must interact with heterogeneous road users, including human-driven vehicles and vulnerable road users, under complex and uncertain conditions. This paper proposes HERMES, a holistic risk-aware end-to-end multimodal driving framework designed to inject explicit long-tail risk cues into trajectory planning. HERMES employs a foundation-model-assisted annotation pipeline to produce structured Long-Tail Scene Context and Long-Tail Planning Context, capturing hazard-centric cues together with maneuver intent and safety preference, and uses these signals to guide end-to-end planning. HERMES further introduces a Tri-Modal Driving Module that fuses multi-view perception, historical motion cues, and semantic guidance, ensuring risk-aware accurate trajectory planning under long-tail scenarios. Experiments on the real-world long-tail dataset demonstrate that HERMES consistently outperforms representative end-to-end and VLM-driven baselines under long-tail mixed-traffic scenarios. Ablation studies verify the complementary contributions of key components.

3 Citations

0 Influential

3 Altmetric

18.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!