2602.03128v1 Feb 03, 2026 cs.AI

다중 에이전트 LLM 프레임워크의 이해: 통합 벤치마크 및 실험적 분석

Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis

Abdelghny Orogat

Citations: 65

h-index: 5

Ana Rostam

Citations: 2

h-index: 1

Essam Mansour

Citations: 11

h-index: 2

다중 에이전트 LLM 프레임워크는 대규모 언어 모델(LLM) 기반 에이전트 시스템의 개발을 가속화하기 위해 널리 사용됩니다. 이러한 프레임워크는 에이전트가 상호 작용하고, 정보를 저장하며, 작업을 조정하는 방식을 제어하는 독특한 아키텍처 구조를 부과합니다. 그러나 이들이 시스템 성능에 미치는 영향은 아직 제대로 이해되지 않고 있습니다. 아키텍처 선택만으로도 지연 시간과 처리량에서 수십 배의 차이를 유발할 수 있으며, 정확도와 확장성에도 상당한 변화를 줄 수 있기 때문에 이러한 격차는 매우 중요합니다. 이 문제를 해결하려면 (i) 오케스트레이션 오버헤드, 메모리 동작, 계획, 전문화 및 조정과 같은 여러 기능을 공동으로 평가하고, (ii) 아키텍처 효과를 분리하기 위해 통제된 프레임워크 수준 조건에서 이러한 평가를 수행해야 합니다. 기존 벤치마크는 개별 기능에만 초점을 맞추고 있으며 표준화된 프레임워크 수준의 평가가 부족합니다. 우리는 (i) 기본적인 차원에 따라 다중 에이전트 LLM 프레임워크를 체계적으로 비교하기 위한 아키텍처 분류 체계를 도입하고, (ii) 기존 벤치마크를 표준화된 실행 파이프라인으로 통합하는 통합 평가 제품군인 MAFBench를 개발하여 이러한 한계를 해결합니다. 우리는 MAFBench를 사용하여 널리 사용되는 여러 프레임워크에 대해 통제된 실증 연구를 수행합니다. 연구 결과에 따르면 프레임워크 수준의 설계 선택만으로도 지연 시간이 100배 이상 증가하고, 계획 정확도가 최대 30% 감소하며, 조정 성공률이 90% 이상에서 30% 미만으로 떨어질 수 있는 것으로 나타났습니다. 마지막으로, 우리는 연구 결과를 구체적인 아키텍처 설계 원칙 및 프레임워크 선택 지침으로 변환하고, 유망한 향후 연구 방향을 제시합니다.

Original Abstract

Multi-agent LLM frameworks are widely used to accelerate the development of agent systems powered by large language models (LLMs). These frameworks impose distinct architectural structures that govern how agents interact, store information, and coordinate tasks. However, their impact on system performance remains poorly understood. This gap is critical, as architectural choices alone can induce order-of-magnitude differences in latency and throughput, as well as substantial variation in accuracy and scalability. Addressing this challenge requires (i) jointly evaluating multiple capabilities, such as orchestration overhead, memory behavior, planning, specialization, and coordination, and (ii) conducting these evaluations under controlled, framework-level conditions to isolate architectural effects. Existing benchmarks focus on individual capabilities and lack standardized framework-level evaluation. We address these limitations by (i) introducing an architectural taxonomy for systematically comparing multi-agent LLM frameworks along fundamental dimensions, and (ii) developing MAFBench, a unified evaluation suite that integrates existing benchmarks under a standardized execution pipeline. Using MAFBench, we conduct a controlled empirical study across several widely used frameworks. Our results show that framework-level design choices alone can increase latency by over 100x, reduce planning accuracy by up to 30%, and lower coordination success from above 90% to below 30%. Finally, we translate our findings into concrete architectural design principles and framework selection guidance, and outline promising future research directions.

2 Citations

0 Influential

2.5 Altmetric

14.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!