2602.02196v1 Feb 02, 2026 cs.AI

TIDE: LLM 에이전트의 테스트 시간 개선에 대한 궤적 기반 진단 평가

TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Xinyu Che

Citations: 14

h-index: 3

Qiushi Sun

Citations: 103

h-index: 5

Qika Lin

Citations: 1,790

h-index: 24

Hang Yan

Citations: 63

h-index: 5

Fangzhi Xu

Citations: 666

h-index: 10

Zichen Ding

Citations: 1,785

h-index: 11

Kanzhi Cheng

Nanjing University

Citations: 1,274

h-index: 12

Jian Zhang

Citations: 26

h-index: 3

Tao Qin

Citations: 29

h-index: 3

Jun Liu

Citations: 149

h-index: 7

최근 자율 LLM 에이전트의 발전은 환경과의 반복적인 상호작용을 통해 성능을 개선할 수 있는 능력을 보여주고 있습니다. 우리는 이러한 패러다임을 테스트 시간 개선(TTI)이라고 정의합니다. 그러나 TTI의 성공 또는 실패 기제는 아직 제대로 이해되지 않고 있으며, 기존 평가 지표들은 작업 최적화 효율성, 오류 발생 후 행동 적응, 작업 완료를 위한 작업 기억(working memory)의 구체적인 효용성을 포착하는 데 실패하고 있습니다. 이러한 한계를 해결하기 위해, 우리는 TTI를 포괄적이고 상호 연결된 세 가지 차원으로 분해하는 에이전트 및 환경 독립적 프레임워크인 TIDE(Test-time Improvement Diagnostic Evaluation)를 제안합니다. 이 프레임워크는 (1) 작업 완료의 전반적인 시간적 역학을 측정하고, 성능 제약의 주원인이 (2) 재귀적 루프 행동인지 또는 (3) 부담스럽게 누적된 메모리인지 식별합니다. 다양한 에이전트와 환경에 대한 광범위한 실험을 통해, TIDE는 에이전트 성능을 향상시키기 위해서는 내부 추론 능력을 확장하는 것 이상이 필요하며, 에이전트와 환경 간의 상호작용 역학을 명시적으로 최적화해야 함을 강조합니다.

Original Abstract

Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps, we propose Test-time Improvement Diagnostic Evaluation (TIDE), an agent-agnostic and environment-agnostic framework that decomposes TTI into three comprehensive and interconnected dimensions. The framework measures (1) the overall temporal dynamics of task completion and (2) identifies whether performance is primarily constrained by recursive looping behaviors or (3) by burdensome accumulated memory. Through extensive experiments across diverse agents and environments, TIDE highlights that improving agent performance requires more than scaling internal reasoning, calling for explicitly optimizing the interaction dynamics between the agent and the environment.

6 Citations

0 Influential

12 Altmetric

66.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!