2605.29512v1 May 28, 2026 cs.AI

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

Leshem Choshen

Citations: 1,060

h-index: 14

Yihang Jiang

Citations: 1

h-index: 1

Yoram Bachrach

Citations: 6,399

h-index: 42

Yuhong Dai

Citations: 38

h-index: 3

Yan-Ru Ju

Citations: 15

h-index: 2

Mathieu Laurière

Citations: 64

h-index: 5

T. Kachman

Citations: 632

h-index: 9

Ilya Makarov

Citations: 0

h-index: 0

Jianzhu Yao

Citations: 87

h-index: 4

P. Viswanath

Citations: 28,394

h-index: 55

Yitian Huang

Citations: 54

h-index: 3

Bobby Cheng

Citations: 21

h-index: 3

Cheston Tan

Citations: 75

h-index: 3

I-Chen Wu

Citations: 6

h-index: 1

M. S. Arya

Citations: 0

h-index: 0

A. Anish

Citations: 0

h-index: 0

Aditya Ranjan

Citations: 3

h-index: 1

Yuan Lu

Citations: 44

h-index: 3

A. Thoni

Citations: 0

h-index: 0

Benjamin Kempinski

Citations: 15

h-index: 2

Ben Finch

Citations: 21

h-index: 1

Leon Guertler

Citations: 77

h-index: 2

Viraj Nadkarni

Citations: 77

h-index: 6

Aliaksei Korshuk

Citations: 82

h-index: 2

Alexander Buyantuev

Citations: 1,052

h-index: 16

Siyuan Wu

Citations: 941

h-index: 12

Yu Cheng

Citations: 93

h-index: 4

I-Hsuan Chu

Citations: 7

h-index: 1

Yu-Yu Yang

Citations: 10

h-index: 2

Qi Cao

Citations: 0

h-index: 0

Yiheng Sun

Citations: 201

h-index: 7

Hongkun Yao

Citations: 154

h-index: 8

Jingxuan Fu

Citations: 8

h-index: 2

Hao Liao

Citations: 15

h-index: 2

Mossimo Ebeling

Citations: 0

h-index: 0

Govind Arun

Citations: 30

h-index: 3

Sadhvik Bathini

Citations: 4

h-index: 1

K. Phatnani

Citations: 11

h-index: 1

Ks Paval

Citations: 7

h-index: 1

V. Mehta

Citations: 21

h-index: 1

S. Aravind

Citations: 21

h-index: 2

Nikhil Arora

Citations: 6

h-index: 1

Tanya Upadhyay

Citations: 8

h-index: 1

Amol Bandagale

Citations: 0

h-index: 0

Chun-Pao Hsiao

Citations: 2

h-index: 1

Yuting Lin

Citations: 52

h-index: 4

A. Chung

Citations: 0

h-index: 0

Jeremiah Thomas

Citations: 0

h-index: 0

Maria Polukarov

Citations: 4

h-index: 1

Atlas Wang

Citations: 52

h-index: 3

K. Wang

Citations: 79

h-index: 5

Tiru Wu

Citations: 0

h-index: 0

Jiwei Zhang

Citations: 4

h-index: 1

Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.

0 Citations

0 Influential

27.5 Altmetric

137.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!