2602.17594v1 Feb 19, 2026 cs.AI

AI 게임스토어: 인간 게임을 활용한 기계 일반 지능의 확장 가능하고 개방적인 평가

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

Phillip Isola

Citations: 54

h-index: 4

Samuel Gershman

Citations: 195

h-index: 3

Lance Ying

Citations: 249

h-index: 9

Prafull Sharma

Citations: 34

h-index: 4

Nathan Cloos

Citations: 5

h-index: 1

Thomas L. Griffiths

Citations: 2,311

h-index: 6

Jos'e Hern'andez-Orallo

Citations: 92

h-index: 5

Joshua B. Tenenbaum

Citations: 492

h-index: 10

Kelsey Allen

Citations: 14

h-index: 3

Kaiyan Zhao

Citations: 56

h-index: 5

Katherine M. Collins

Citations: 95

h-index: 5

Ryan Truong

Citations: 18

h-index: 2

급속한 기술 발전의 시대에 인간 일반 지능의 넓은 스펙트럼에 맞춰 기계 지능을 엄격하게 평가하는 것은 점점 더 중요해지고 도전적인 과제가 되었습니다. 기존의 AI 벤치마크는 일반적으로 제한된 범위의 인간 활동에서 좁은 능력만을 평가합니다. 또한 대부분 정적이며, 개발자가 명시적 또는 암묵적으로 이에 최적화함에 따라 빠르게 포화 상태에 이릅니다. 우리는 AI 시스템에서 인간과 유사한 일반 지능을 평가하는 더 유망한 방법으로 일반 게임 플레이의 특히 강력한 형태를 제안합니다. 즉, 동일한 수준의 경험, 시간 또는 기타 자원을 가진 인간 플레이어와 비교하여 AI가 상상할 수 있는 모든 인간 게임을 어떻게, 그리고 얼마나 잘 플레이하고 학습하는지 연구하는 것입니다. 우리는 "인간 게임"을 인간이 인간을 위해 설계한 게임으로 정의하며, 사람들이 상상하고 즐길 수 있는 이러한 모든 게임의 공간인 "인간 게임의 다중 우주(Multiverse of Human Games)"가 가지는 평가적 적합성을 주장합니다. 이러한 비전을 향한 첫걸음으로, 우리는 인기 있는 인간 디지털 게임 플랫폼에서 표준화되고 컨테이너화된 게임 환경 변형을 자동으로 소싱하고 조정함으로써 인간 참여형(human-in-the-loop) LLM을 사용하여 새롭고 대표적인 인간 게임을 합성하는 확장 가능하고 개방적인 플랫폼인 AI GameStore를 소개합니다. 개념 증명으로서, 우리는 애플 앱스토어 및 스팀의 인기 차트를 기반으로 100개의 이러한 게임을 생성했으며, 짧은 플레이 에피소드에 대해 7개의 최첨단 시각-언어 모델(VLM)을 평가했습니다. 최고의 모델조차 대다수의 게임에서 인간 평균 점수의 10% 미만을 달성하는 데 그쳤으며, 특히 세계 모델(world-model) 학습, 기억력 및 계획 능력을 요구하는 게임에서 큰 어려움을 겪었습니다. 마지막으로 우리는 기계가 인간과 유사한 일반 지능을 향해 나아가는 과정을 측정하고 촉진하는 실용적인 방법으로서 AI GameStore를 구축하기 위한 향후 단계들을 제시하며 결론을 맺습니다.

Original Abstract

Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.

5 Citations

1 Influential

5 Altmetric

32.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!