2604.12320v1 Apr 14, 2026 cs.CV

EgoEsportsQA: 지각 및 추론 능력을 평가하기 위한 1인칭 비디오 벤치마크 (e스포츠 분야)

EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

Shan Chen

Citations: 37

h-index: 3

Zhong Cao

Citations: 856

h-index: 5

Yichen Xu

Citations: 50

h-index: 2

Wenxuan Wang

Citations: 138

h-index: 4

Qin Jin

Citations: 19

h-index: 3

Jianzhe Ma

Renmin University of China, School of Information

Citations: 6

h-index: 2

비디오 거대 언어 모델(Video-LLM)은 일반적으로 느린 속도의 실세계 1인칭 비디오를 이해하는 데 뛰어나지만, 고속의 정보가 밀집된 가상 환경에서의 성능은 충분히 연구되지 않았습니다. 기존 벤치마크는 일상적인 활동에 초점을 맞추고 있지만, 가상 시나리오에서 빠른 속도와 규칙 기반 추론 능력을 엄격하게 평가할 수 있는 테스트 환경은 부족합니다. 이러한 격차를 해소하기 위해, 우리는 전문가 수준의 e스포츠 지식을 기반으로 지각 및 추론 능력을 평가하는 1인칭 비디오 질의응답(QA) 벤치마크인 EgoEsportsQA를 제안합니다. 우리는 확장 가능한 6단계 파이프라인을 통해 3개의 1인칭 슈팅 게임의 프로 경기에서 1,745개의 고품질 QA 쌍을 수집했습니다. 이러한 질문은 인지 능력 차원(지각 및 추론 수준을 포함)의 11개 하위 작업과 e스포츠 지식 차원의 6개 하위 작업으로 구성된 2차원 분리된 분류 체계로 구조화되었습니다. 최첨단 Video-LLM에 대한 종합적인 평가는 현재 모델이 여전히 만족스러운 성능을 달성하지 못한다는 것을 보여줍니다. 최고 성능 모델의 정확도는 71.58%에 불과합니다. 결과는 두 가지 측면 모두에서 상당한 격차를 드러냅니다. 모델은 기본적인 시각적 지각 능력은 뛰어나지만, 심층적인 전술적 추론 능력은 부족하며, 전체적인 거시 흐름은 잘 파악하지만, 세부적인 미시 동작은 이해하지 못합니다. 광범위한 분석 실험을 통해 현재 Video-LLM 아키텍처의 고유한 약점을 입증했습니다. 추가 분석 결과, 저희 데이터셋은 실제 세계와 가상 1인칭 환경 간의 연관성을 밝히는 것뿐만 아니라, 하위 e스포츠 애플리케이션을 최적화하기 위한 지침을 제공하여 다양한 1인칭 환경에서 Video-LLM의 미래 발전을 촉진합니다.

Original Abstract

While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!