2605.27068v1 May 26, 2026 cs.CL

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Hao Liu
Hao Liu
Citations: 66
h-index: 4
Zeyu Li
Zeyu Li
Citations: 132
h-index: 5
Fuyuan Lyu
Fuyuan Lyu
McGill University
Citations: 659
h-index: 15
Xue Liu
Xue Liu
Citations: 7
h-index: 2
Bowei He
Bowei He
Citations: 211
h-index: 5
Ye Yuan
Ye Yuan
Citations: 12
h-index: 2
Haolun Wu
Haolun Wu
Citations: 25
h-index: 3
Yonghan Yang
Yonghan Yang
Citations: 3
h-index: 1
Ruiqi Song
Ruiqi Song
Citations: 66
h-index: 3
Weien Li
Weien Li
Citations: 0
h-index: 0
Xiangyun Kong
Xiangyun Kong
Citations: 2
h-index: 1
Chang-Gyoung Han
Chang-Gyoung Han
Citations: 3
h-index: 1
Zicheng Zhao
Zicheng Zhao
Citations: 131
h-index: 4
Zixuan Dong
Zixuan Dong
Citations: 28
h-index: 2
Jikun Kang
Jikun Kang
Citations: 18
h-index: 2

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.

0 Citations
0 Influential
34.431471805599 Altmetric
172.2 Score
Original PDF
3

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!