2601.16280v1 Jan 22, 2026 cs.AI

에이전트가 작동하지 못할 때: 다중 에이전트 LLM 시스템에서 도구 호출 신뢰성 평가를 위한 진단 프레임워크

When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems

Donghao Huang

Citations: 95

h-index: 4

Gauri Malwe

Citations: 1

h-index: 1

Zhaoxia Wang

Citations: 58

h-index: 4

대규모 언어 모델(LLM)로 구동되는 다중 에이전트 시스템은 기업 자동화를 혁신하고 있지만, 도구 사용 신뢰성을 평가하기 위한 체계적인 방법론은 아직 개발되지 않았습니다. 본 연구에서는 빅데이터 분석을 활용하여 지능형 에이전트 시스템의 절차적 신뢰성을 평가하는 포괄적인 진단 프레임워크를 소개하며, 개인 정보 보호가 중요한 환경에서 중소기업 중심의 배포에 필요한 핵심 요구 사항을 해결합니다. 제안하는 접근 방식은 도구 초기화, 파라미터 처리, 실행 및 결과 해석에 걸쳐 발생하는 오류 유형을 12가지 범주로 분류하는 오류 분류 체계를 특징으로 합니다. 개방형 모델(Qwen2.5 시리즈, Functionary)과 독점 모델(GPT-4, Claude 3.5/3.7)을 포함하여 다양한 하드웨어 구성에서 1,980개의 결정론적 테스트 사례를 체계적으로 평가하여, 실제 배포를 위한 신뢰성 기준을 제시합니다. 분석 결과, 절차적 신뢰성, 특히 도구 초기화 실패가 소형 모델의 주요 병목 현상이며, qwen2.5:32b 모델은 GPT-4.1과 동등한 완벽한 성능을 달성했습니다. 또한, 중간 크기의 모델(qwen2.5:14b)이 일반적인 하드웨어(96.6% 성공률, 7.3초 지연 시간)에서 실용적인 정확도-효율성 균형을 제공하여, 자원이 제한된 조직에서 비용 효율적인 지능형 에이전트 배포를 가능하게 합니다. 본 연구는 도구 기반 다중 에이전트 AI 시스템의 체계적인 신뢰성 평가를 위한 기반 인프라를 구축합니다.

Original Abstract

Multi-agent systems powered by large language models (LLMs) are transforming enterprise automation, yet systematic evaluation methodologies for assessing tool-use reliability remain underdeveloped. We introduce a comprehensive diagnostic framework that leverages big data analytics to evaluate procedural reliability in intelligent agent systems, addressing critical needs for SME-centric deployment in privacy-sensitive environments. Our approach features a 12-category error taxonomy capturing failure modes across tool initialization, parameter handling, execution, and result interpretation. Through systematic evaluation of 1,980 deterministic test instances spanning both open-weight models (Qwen2.5 series, Functionary) and proprietary alternatives (GPT-4, Claude 3.5/3.7) across diverse edge hardware configurations, we identify actionable reliability thresholds for production deployment. Our analysis reveals that procedural reliability, particularly tool initialization failures, constitutes the primary bottleneck for smaller models, while qwen2.5:32b achieves flawless performance matching GPT-4.1. The framework demonstrates that mid-sized models (qwen2.5:14b) offer practical accuracy-efficiency trade-offs on commodity hardware (96.6\% success rate, 7.3 s latency), enabling cost-effective intelligent agent deployment for resource-constrained organizations. This work establishes foundational infrastructure for systematic reliability evaluation of tool-augmented multi-agent AI systems.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!