2601.09503v1 Jan 14, 2026 cs.AI

LLM 에이전트는 그들의 세계에 대해 무엇을 알고 있는가? Task2Quiz: 환경 이해 연구를 위한 패러다임

What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding

Siyuan Liu

Citations: 12

h-index: 2

Xinze Li

Citations: 216

h-index: 5

Ziyue Zhu

Citations: 2

h-index: 1

Yixin Cao

Citations: 149

h-index: 5

Yu-Gang Jiang

Citations: 122

h-index: 1

Hongbang Yuan

Citations: 461

h-index: 7

대규모 언어 모델(LLM) 에이전트는 복잡한 의사 결정 및 도구 사용 작업에서 놀라운 능력을 보여주었지만, 다양한 환경에 걸쳐 일반화하는 능력은 여전히 충분히 검토되지 않은 문제로 남아 있습니다. 현재의 평가 패러다임은 주로 작업 성공 여부를 측정하는 궤적 기반 지표에 의존하고 있으며, 에이전트가 환경에 대한 근거 있고(grounded) 전이 가능한 모델을 보유하고 있는지는 평가하지 못하고 있습니다. 이러한 격차를 해소하기 위해, 우리는 작업 수행과 세계 상태(world-state) 이해를 분리하도록 설계된 결정론적이고 자동화된 평가 패러다임인 Task-to-Quiz(T2Q)를 제안합니다. 우리는 이 패러다임을 다양한 난이도에 걸친 30개의 환경과 1,967개의 근거 있는 QA 쌍으로 구성된 벤치마크인 T2QBench로 구현했습니다. 광범위한 실험 결과, 작업 성공 여부가 환경 이해를 대변하는 지표로서는 부적절한 경우가 많으며, 현재의 메모리 메커니즘은 에이전트가 환경에 대한 근거 있는 모델을 습득하는 데 효과적으로 도움을 주지 못한다는 사실이 밝혀졌습니다. 이러한 발견은 능동적 탐색과 세밀한 상태 표현이 주요 병목점임을 확인시켜 주며, 보다 일반화 가능한 자율 에이전트 개발을 위한 견고한 기반을 제공합니다.

Original Abstract

Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks, yet their ability to generalize across varying environments remains a under-examined concern. Current evaluation paradigms predominantly rely on trajectory-based metrics that measure task success, while failing to assess whether agents possess a grounded, transferable model of the environment. To address this gap, we propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding. We instantiate this paradigm in T2QBench, a suite comprising 30 environments and 1,967 grounded QA pairs across multiple difficulty levels. Our extensive experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment. These findings identify proactive exploration and fine-grained state representation as primary bottlenecks, offering a robust foundation for developing more generalizable autonomous agents.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!