2605.14866v1 May 14, 2026 cs.SE

멀티 에이전트 재귀적 사고를 활용한 마이크로 서비스의 심층 원인 분석

Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

Gong Zhang

Citations: 64

h-index: 4

Tong Jia

Citations: 432

h-index: 11

Chiming Duan

Citations: 291

h-index: 10

Minghua He

Citations: 125

h-index: 7

Rongqian Wang

Citations: 8

h-index: 2

Meiling Wang

Citations: 47

h-index: 3

Renhai Chen

Citations: 67

h-index: 4

Ying Li

Citations: 23

h-index: 2

Lingzhe Zhang

Citations: 328

h-index: 10

Kang Wang

Citations: 0

h-index: 0

Xiang Peng

Citations: 7

h-index: 1

현대 마이크로 서비스 시스템은 동적인 상호 작용과 변화하는 런타임 환경으로 인해 점점 더 복잡해지고 있으며, 그 결과 시스템 오류 발생 빈도가 증가하고 있습니다. 따라서 시스템 안정성을 확보하기 위해서는 정확한 원인 분석(RCL, Root Cause Localization)이 매우 중요합니다. 기존에는 다양한 머신 러닝 및 딥 러닝 방법이 이 문제에 적용되었지만, 이러한 방법들은 종종 해석 가능성이 낮고, 다양한 환경으로의 적용성이 떨어진다는 단점을 가지고 있습니다. 최근에는 이러한 문제를 해결하기 위해 대규모 언어 모델(LLM) 기반 방법들이 제안되었습니다. 그러나 기존의 LLM 기반 방법들은 여전히 두 가지 근본적인 한계점을 가지고 있습니다. 첫째는 중요한 증거를 희석시키고 분석 정확도를 저하시키는 '컨텍스트 폭증(context explosion)' 현상이며, 둘째는 깊이 있는 인과 관계 탐색을 방해하고 추론 효율성을 저해하는 '순차적 추론 구조'입니다. 본 논문에서는 인간 SRE(Site Reliability Engineer)가 실제로 어떻게 원인 분석을 수행하는지, 그리고 기존 LLM 기반 방법들이 왜 이러한 한계를 가지는지에 대한 종합적인 연구를 수행합니다. 이러한 연구 결과를 바탕으로, 우리는 마이크로 서비스 시스템을 위한 심층 원인 분석 프레임워크인 RCLAgent를 제안합니다. RCLAgent는 병렬 추론을 가능하게 하는 멀티 에이전트 재귀적 사고 방식을 구현합니다. RCLAgent는 진단 과정을 추적 그래프를 따라 분해하고, 각 구간을 전용 에이전트에 할당하며, 그래프 토폴로지에 따라 에이전트를 재귀적이고 병렬적으로 구성합니다. 최종 진단 결과는 루트 레벨 진단 보고서와 글로벌 증거 그래프를 종합하여 얻어집니다. 여러 공개 벤치마크를 사용한 광범위한 실험 결과, RCLAgent는 기존 최고 성능 모델보다 원인 분석 정확도와 추론 효율성 모두에서 일관되게 우수한 성능을 보였습니다.

Original Abstract

As modern microservice systems grow increasingly complex due to dynamic interactions and evolving runtime environments, they experience failures with rising frequency. Ensuring system reliability therefore critically depends on accurate root cause localization (RCL). While numerous traditional machine learning and deep learning approaches have been explored for this task, they often suffer from limited interpretability and poor transferability across deployments. More recently, large language model (LLM)-based methods have been proposed to address these issues. However, existing LLM-based approaches still face two fundamental limitations: context explosion, which dilutes critical evidence and degrades localization accuracy, and serial reasoning structures, which hinder deep causal exploration and impair inference efficiency. In this paper, we conduct a comprehensive study of both how human SREs perform root cause localization in practice and why existing LLM-based methods fall short. Motivated by these findings, we introduce RCLAgent, an in-depth root cause localization framework for microservice systems that realizes multi-agent recursion-of-thought with parallel reasoning. RCLAgent decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root-Level Diagnosis Report and the Global Evidence Graph. Extensive experiments on multiple public benchmarks demonstrate that RCLAgent consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!