2603.18480v1 Mar 19, 2026 cs.CV

비전-언어 모델이 게임에서의 인간의 몰입도를 이해하는가?

Do Vision Language Models Understand Human Engagement in Games?

Xiyang Hu

Citations: 3

h-index: 1

Ziyi Wang

Citations: 15

h-index: 2

Qi Guo

Citations: 40

h-index: 4

Rishitosh Singh

Citations: 0

h-index: 0

게임 디자인 및 사용자 경험 연구에서 게임 플레이 영상으로부터 인간의 몰입도를 추론하는 것은 중요하지만, 비전-언어 모델(VLM)이 시각적 단서만으로 그러한 잠재적인 심리 상태를 추론할 수 있는지 여부는 불분명합니다. 본 연구에서는 GameVibe Few-Shot 데이터셋을 사용하여 9개의 1인칭 슈팅 게임에 대해 3개의 VLM을 평가하고, 제로샷 예측, 흐름(Flow), 게임플로우(GameFlow), 자율성 결정 이론(Self-Determination Theory), 그리고 MDA 이론에 기반한 이론 지향 프롬프트, 그리고 검색 증강 프롬프트 등 6가지 프롬프트 전략을 사용했습니다. 우리는 각 프레임 단위의 몰입도 예측과 연속적인 프레임 간의 몰입도 변화 예측을 모두 고려했습니다. 결과는 제로샷 VLM 예측이 일반적으로 좋지 않으며, 종종 각 게임별 다수 클래스 기준을 넘어서지 못한다는 것을 보여줍니다. 메모리 또는 검색 증강 프롬프트는 일부 환경에서 프레임 단위 예측을 개선하지만, 프레임 간 변화 예측은 모든 전략에서 일관되게 어렵습니다. 이론 지향 프롬프트만으로는 안정적인 도움을 얻기 어렵고, 오히려 표면적인 단서를 강화할 수 있습니다. 이러한 결과는 현재 VLM의 인식-이해 격차를 시사합니다. 즉, VLM은 보이는 게임 플레이 단서를 인식할 수 있지만, 게임 전반에 걸쳐 인간의 몰입도를 안정적으로 추론하는 데 어려움을 겪고 있습니다.

Original Abstract

Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision--language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception--understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!