2605.29512v1 May 28, 2026 cs.AI

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

Leshem Choshen
Leshem Choshen
Citations: 1,060
h-index: 14
Yihang Jiang
Yihang Jiang
Citations: 1
h-index: 1
Yoram Bachrach
Yoram Bachrach
Citations: 6,399
h-index: 42
Yuhong Dai
Yuhong Dai
Citations: 38
h-index: 3
Yan-Ru Ju
Yan-Ru Ju
Citations: 15
h-index: 2
Mathieu Laurière
Mathieu Laurière
Citations: 64
h-index: 5
T. Kachman
T. Kachman
Citations: 632
h-index: 9
Ilya Makarov
Ilya Makarov
Citations: 0
h-index: 0
Jianzhu Yao
Jianzhu Yao
Citations: 87
h-index: 4
P. Viswanath
P. Viswanath
Citations: 28,394
h-index: 55
Yitian Huang
Yitian Huang
Citations: 54
h-index: 3
Bobby Cheng
Bobby Cheng
Citations: 21
h-index: 3
Cheston Tan
Cheston Tan
Citations: 75
h-index: 3
I-Chen Wu
I-Chen Wu
Citations: 6
h-index: 1
M. S. Arya
M. S. Arya
Citations: 0
h-index: 0
A. Anish
A. Anish
Citations: 0
h-index: 0
Aditya Ranjan
Aditya Ranjan
Citations: 3
h-index: 1
Yuan Lu
Yuan Lu
Citations: 44
h-index: 3
A. Thoni
A. Thoni
Citations: 0
h-index: 0
Benjamin Kempinski
Benjamin Kempinski
Citations: 15
h-index: 2
Ben Finch
Ben Finch
Citations: 21
h-index: 1
Leon Guertler
Leon Guertler
Citations: 77
h-index: 2
Viraj Nadkarni
Viraj Nadkarni
Citations: 77
h-index: 6
Aliaksei Korshuk
Aliaksei Korshuk
Citations: 82
h-index: 2
Alexander Buyantuev
Alexander Buyantuev
Citations: 1,052
h-index: 16
Siyuan Wu
Siyuan Wu
Citations: 941
h-index: 12
Yu Cheng
Yu Cheng
Citations: 93
h-index: 4
I-Hsuan Chu
I-Hsuan Chu
Citations: 7
h-index: 1
Yu-Yu Yang
Yu-Yu Yang
Citations: 10
h-index: 2
Qi Cao
Qi Cao
Citations: 0
h-index: 0
Yiheng Sun
Yiheng Sun
Citations: 201
h-index: 7
Hongkun Yao
Hongkun Yao
Citations: 154
h-index: 8
Jingxuan Fu
Jingxuan Fu
Citations: 8
h-index: 2
Hao Liao
Hao Liao
Citations: 15
h-index: 2
Mossimo Ebeling
Mossimo Ebeling
Citations: 0
h-index: 0
Govind Arun
Govind Arun
Citations: 30
h-index: 3
Sadhvik Bathini
Sadhvik Bathini
Citations: 4
h-index: 1
K. Phatnani
K. Phatnani
Citations: 11
h-index: 1
Ks Paval
Ks Paval
Citations: 7
h-index: 1
V. Mehta
V. Mehta
Citations: 21
h-index: 1
S. Aravind
S. Aravind
Citations: 21
h-index: 2
Nikhil Arora
Nikhil Arora
Citations: 6
h-index: 1
Tanya Upadhyay
Tanya Upadhyay
Citations: 8
h-index: 1
Amol Bandagale
Amol Bandagale
Citations: 0
h-index: 0
Chun-Pao Hsiao
Chun-Pao Hsiao
Citations: 2
h-index: 1
Yuting Lin
Yuting Lin
Citations: 52
h-index: 4
A. Chung
A. Chung
Citations: 0
h-index: 0
Jeremiah Thomas
Jeremiah Thomas
Citations: 0
h-index: 0
Maria Polukarov
Maria Polukarov
Citations: 4
h-index: 1
Atlas Wang
Atlas Wang
Citations: 52
h-index: 3
K. Wang
K. Wang
Citations: 79
h-index: 5
Tiru Wu
Tiru Wu
Citations: 0
h-index: 0
Jiwei Zhang
Jiwei Zhang
Citations: 4
h-index: 1

Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.

0 Citations
0 Influential
27.5 Altmetric
137.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!