2602.02475v1 Feb 02, 2026 cs.AI

AgentRx: 실행 궤적을 통한 AI 에이전트 실패 진단

AgentRx: Diagnosing AI Agent Failures from Execution Trajectories

Shraddha Barke

Citations: 741

h-index: 7

Alind Khare

Citations: 301

h-index: 8

Suman Nath

Citations: 107

h-index: 5

Chetan Bansal

Citations: 794

h-index: 15

Arnav Goyal

Citations: 66

h-index: 3

Avaljot Singh

Citations: 39

h-index: 3

AI 에이전트의 실행은 확률적이고 호흡이 길며(long-horizon), 다중 에이전트가 관여하고 노이즈가 섞인 도구 출력에 의존하기 때문에, 실패가 발생했을 때 그 원인 지점을 파악하기 어려운 경우가 많습니다. 우리는 실패한 에이전트 실행 내역에 수동으로 주석을 달아 이러한 문제를 해결하고자 하며, 구조화된 API 워크플로우, 사고 관리, 개방형 웹/파일 작업을 아우르는 115개의 실패 궤적(trajectory)으로 구성된 새로운 벤치마크를 공개합니다. 각 궤적에는 결정적인 실패 단계와 근거 이론(grounded-theory) 기반의 교차 도메인 실패 분류 체계에 따른 카테고리가 주석으로 포함되어 있습니다. 실패 원인 귀속(attribution)에 소요되는 인적 비용을 완화하기 위해, 우리는 실패한 에이전트 궤적 내에서 결정적인 실패 단계를 식별하는 자동화된 도메인 범용 진단 프레임워크인 AGENTRX를 제안합니다. 이 프레임워크는 제약 조건을 합성하고 이를 단계별로 평가하여 증거가 포함된 감사 가능한 제약 조건 위반 검증 로그를 생성하며, LLM 기반 판별자(judge)가 이 로그를 활용해 결정적인 단계와 카테고리를 특정합니다. 우리의 프레임워크는 세 가지 도메인에서 기존 베이스라인 대비 단계 식별 및 실패 원인 귀속 성능을 향상시켰습니다.

Original Abstract

AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API workflows, incident management, and open-ended web/file tasks. Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy. To mitigate the human cost of failure attribution, we present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints, evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence; an LLM-based judge uses this log to localize the critical step and category. Our framework improves step localization and failure attribution over existing baselines across three domains.

19 Citations

2 Influential

7.5 Altmetric

60.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!