2604.02710v1 Apr 03, 2026 cs.RO

V2X-QA: 자율 주행 분야의 다중 모드 대규모 언어 모델을 위한 포괄적인 추론 데이터셋 및 벤치마크 (Ego, 인프라, 협력 관점을 포함)

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

Rui Gan

Citations: 98

h-index: 5

B. Ran

Citations: 116

h-index: 5

Sikai Chen

Citations: 533

h-index: 14

Weizhe Tang

Citations: 9

h-index: 2

Junwei You

Citations: 165

h-index: 6

Zilin Huang

Citations: 74

h-index: 4

Pei Li

Citations: 31

h-index: 2

Zhuoyu Jiang

Citations: 65

h-index: 3

Jiaxin Liu

Citations: 2

h-index: 1

Yan Zhao

Citations: 21

h-index: 3

다중 모드 대규모 언어 모델(MLLM)은 자율 주행 분야에서 강력한 잠재력을 보여주었지만, 기존 벤치마크는 주로 운전자 중심적이며, 따라서 모델의 성능을 인프라 중심적이고 협력적인 주행 환경에서 체계적으로 평가할 수 없습니다. 본 연구에서는 차량, 인프라, 협력 관점을 포괄하는 MLLM 평가를 위한 실제 데이터셋 및 벤치마크인 V2X-QA를 소개합니다. V2X-QA는 차량만, 인프라만, 협력 주행 환경을 통합하여 비교할 수 있는, 관점 분리 평가 프로토콜을 기반으로 구축되었습니다. 이 벤치마크는 인지, 예측, 추론 및 계획을 아우르는 12가지 과제로 구성되어 있으며, 전문가 검증을 거친 객관식 질문 답변(MCQA) 어노테이션을 통해 관점 의존적인 기능에 대한 세밀한 분석을 가능하게 합니다. 10개의 대표적인 최첨단 독점 및 오픈 소스 모델에 대한 벤치마크 결과는 관점 접근성이 성능에 상당한 영향을 미치며, 인프라 측면의 추론이 의미 있는 거시적 교통 이해를 지원한다는 것을 보여줍니다. 또한, 협력적인 추론은 여전히 어려운 과제이며, 이는 단순히 추가적인 시각 정보를 제공하는 것 이상의 교차 관점 정렬 및 증거 통합을 필요로 하기 때문입니다. 이러한 과제를 해결하기 위해, 본 연구에서는 벤치마크에 맞춰 설계된 기본 모델인 V2X-MoE를 소개합니다. V2X-MoE는 명시적인 관점 라우팅 및 관점별 LoRA 전문가를 포함합니다. V2X-MoE의 뛰어난 성능은 명시적인 관점 전문화가 자율 주행 분야의 다중 관점 추론을 위한 유망한 방향임을 더욱 시사합니다. 전반적으로, V2X-QA는 연결된 자율 주행 시스템에서 다각적인 추론, 신뢰성 및 협력적 물리적 지능을 연구하기 위한 기반을 제공합니다. 데이터셋 및 V2X-MoE 리소스는 다음 위치에서 공개적으로 이용할 수 있습니다: https://github.com/junwei0001/V2X-QA.

Original Abstract

Multimodal large language models (MLLMs) have shown strong potential for autonomous driving, yet existing benchmarks remain largely ego-centric and therefore cannot systematically assess model performance in infrastructure-centric and cooperative driving conditions. In this work, we introduce V2X-QA, a real-world dataset and benchmark for evaluating MLLMs across vehicle-side, infrastructure-side, and cooperative viewpoints. V2X-QA is built around a view-decoupled evaluation protocol that enables controlled comparison under vehicle-only, infrastructure-only, and cooperative driving conditions within a unified multiple-choice question answering (MCQA) framework. The benchmark is organized into a twelve-task taxonomy spanning perception, prediction, and reasoning and planning, and is constructed through expert-verified MCQA annotation to enable fine-grained diagnosis of viewpoint-dependent capabilities. Benchmark results across ten representative state-of-the-art proprietary and open-source models show that viewpoint accessibility substantially affects performance, and infrastructure-side reasoning supports meaningful macroscopic traffic understanding. Results also indicate that cooperative reasoning remains challenging since it requires cross-view alignment and evidence integration rather than simply additional visual input. To address these challenges, we introduce V2X-MoE, a benchmark-aligned baseline with explicit view routing and viewpoint-specific LoRA experts. The strong performance of V2X-MoE further suggests that explicit viewpoint specialization is a promising direction for multi-view reasoning in autonomous driving. Overall, V2X-QA provides a foundation for studying multi-perspective reasoning, reliability, and cooperative physical intelligence in connected autonomous driving. The dataset and V2X-MoE resources are publicly available at: https://github.com/junwei0001/V2X-QA.

0 Citations

0 Influential

32.493061443341 Altmetric

162.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!