2602.11103v1 Feb 11, 2026 cs.AI

GameDevBench: 게임 개발을 통한 에이전트 능력 평가

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Wayne Chi

Citations: 185

h-index: 3

Arnav Yayavaram

Citations: 5

h-index: 1

Siddharth Yayavaram

Citations: 5

h-index: 1

Ameet Talwalkar

Citations: 34,501

h-index: 53

Chris Donahue

Citations: 5

h-index: 1

Qi Wei

Citations: 852

h-index: 4

Alex Wang

Citations: 120

h-index: 5

Valerie Chen

Carnegie Mellon University

Citations: 982

h-index: 15

Yi Fang

Citations: 35

h-index: 3

Runkun Chen

Citations: 645

h-index: 13

Seth Karten

Citations: 40

h-index: 4

코딩 에이전트의 빠른 발전에도 불구하고, 이에 상응하는 멀티모달 모델의 발전은 뒤처져 있습니다. 주요 과제 중 하나는 소프트웨어 개발의 복잡성과 심층적인 멀티모달 이해의 필요성을 결합한 평가 테스트베드가 부족하다는 것입니다. 게임 개발은 에이전트가 방대한 코드베이스를 탐색하는 동시에 시각적 게임 장면 내에서 셰이더, 스프라이트, 애니메이션과 같은 본질적으로 멀티모달인 자산들을 조작해야 하므로 이러한 테스트베드를 제공합니다. 우리는 게임 개발 작업에서 에이전트를 평가하기 위한 최초의 벤치마크인 GameDevBench를 제안합니다. GameDevBench는 웹 및 비디오 튜토리얼에서 파생된 132개의 작업으로 구성됩니다. 이 작업들은 상당한 수준의 멀티모달 이해를 요구하며 복잡합니다. 평균 솔루션은 기존 소프트웨어 개발 벤치마크에 비해 3배 이상의 코드 라인 수와 파일 변경을 필요로 합니다. 에이전트들은 여전히 게임 개발에 어려움을 겪고 있으며, 최고의 에이전트조차 54.5%의 작업만 해결했습니다. 우리는 인지된 작업 난이도와 멀티모달 복잡성 사이에 강한 상관관계가 있음을 확인했으며, 성공률은 게임플레이 중심 작업의 46.9%에서 2D 그래픽 작업의 31.6%로 떨어졌습니다. 멀티모달 역량을 향상시키기 위해, 우리는 에이전트를 위한 두 가지 간단한 이미지 및 비디오 기반 피드백 메커니즘을 도입합니다. 이러한 방법들은 단순함에도 불구하고 성능을 일관되게 향상시켰으며, 가장 큰 변화는 Claude Sonnet 4.5의 성능이 33.3%에서 47.7%로 증가한 것입니다. 우리는 에이전트 게임 개발에 대한 추가 연구를 지원하기 위해 GameDevBench를 공개합니다.

Original Abstract

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.

0 Citations

0 Influential

26.5 Altmetric

132.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!