2604.16022v1 Apr 17, 2026 cs.AI

SocialGrid: 임베디드 다중 에이전트 시스템에서 계획 및 사회적 추론을 위한 벤치마크

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

P. Schramowski

Citations: 8,493

h-index: 27

Hanzhao Lin

Citations: 3,273

h-index: 4

Kristian Kersting

Citations: 95

h-index: 5

Lukas Helff

Citations: 114

h-index: 5

Hikaru Shindo

Citations: 299

h-index: 9

대규모 언어 모델(LLM)이 텍스트 처리 도구에서 자율 에이전트로 전환됨에 따라, 임베디드 다중 에이전트 환경에서 LLM 에이전트의 사회적 추론 능력을 평가하는 것이 중요해지고 있습니다. 본 논문에서는 Among Us에서 영감을 받은 임베디드 다중 에이전트 환경인 SocialGrid를 소개하며, 이 환경은 LLM 에이전트의 계획 수립, 작업 수행 및 사회적 추론 능력을 평가합니다. 우리의 실험 결과는 가장 강력한 공개 모델(GPT-OSS-120B)조차도 작업 완료 및 계획 수립에서 60% 미만의 정확도를 보인다는 것을 보여줍니다. 또한, 에이전트들이 반복적인 행동에 빠지거나 기본적인 장애물을 극복하지 못하는 경우가 있었습니다. 사회적 지능 평가에 있어 잘못된 탐색은 평가를 방해할 수 있으므로, SocialGrid는 계획 결함과 사회적 추론을 분리하기 위한 선택적인 계획 오라클 기능을 제공합니다. 계획 지원은 작업 완료율을 향상시키지만, 사회적 추론은 여전히 중요한 과제로 남아 있습니다. 에이전트들은 규모에 관계없이 속임수를 거의 무작위 수준으로 감지하지 못하며, 행동적 증거를 축적하는 대신 피상적인 휴리스틱에 의존합니다. SocialGrid는 자동 실패 분석 및 세분화된 지표를 제공하여 개발자들이 에이전트의 문제점을 진단하고 개선할 수 있도록 지원합니다. 또한, 적대적인 리그 플레이에서 Elo 레이팅을 사용하여 경쟁력 있는 순위표를 구축했습니다.

Original Abstract

As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.

0 Citations

0 Influential

13.5 Altmetric

67.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!