2601.02732v1 Jan 06, 2026 cs.SE

마이크로 서비스에서의 근본 원인 분석을 위한 능동적 메모리 기반의 재귀적 추론

Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices

Lingzhe Zhang

Citations: 73

h-index: 5

Tong Jia

Citations: 432

h-index: 11

Yunpeng Zhai

Citations: 197

h-index: 8

Leyi Pan

Citations: 12

h-index: 2

Chiming Duan

Citations: 291

h-index: 10

Minghua He

Citations: 125

h-index: 7

Ying Li

Citations: 483

h-index: 12

Mengxi Jia

Citations: 201

h-index: 8

현대적인 마이크로 서비스 시스템은 점점 더 보편화되고 복잡해지고 있으며, 종종 수백 또는 수천 개의 세분화되고 상호 의존적인 하위 시스템으로 구성됩니다. 이러한 시스템은 빈번한 오류를 경험하고 있으며, 시스템의 안정성을 확보하기 위해서는 정확한 근본 원인 분석이 필수적입니다. 기존의 그래프 기반 및 딥러닝 접근 방식이 이 문제에 대해 많이 연구되었지만, 이러한 방식은 종종 사전 정의된 스키마에 크게 의존하며, 변화하는 운영 환경에 적응하기 어렵습니다. 결과적으로, 최근에는 LLM 기반의 방법들이 제안되고 있습니다. 그러나 이러한 방법들은 여전히 두 가지 주요한 한계를 가지고 있습니다. 첫째, 정확성을 저해하는 피상적이고 증상 중심적인 추론이며, 둘째, 여러 경고에 대한 재사용 부족으로 인해 중복적인 추론이 발생하고 지연 시간이 증가합니다. 본 논문에서는 다양한 조직의 전문가들을 대상으로 사이트 안정성 엔지니어(SRE)들이 어떻게 시스템 오류의 근본 원인을 분석하는지에 대한 종합적인 연구를 수행했습니다. 우리의 연구 결과, 전문가들의 근본 원인 분석은 세 가지 주요 특징을 나타냅니다. 즉, 재귀성, 다차원 확장성, 그리고 다중 모드 추론입니다. 이러한 연구 결과를 바탕으로, 우리는 마이크로 서비스에서의 근본 원인 분석을 위한 능동적 메모리 기반의 재귀적 추론 프레임워크인 AMER-RCL을 소개합니다. AMER-RCL은 Recursive Reasoning RCL 엔진을 사용하며, 이는 다중 에이전트 프레임워크로서 각 경고에 대해 재귀적인 추론을 수행하여 잠재적인 원인을 점진적으로 개선합니다. 또한, Agentic Memory는 특정 시간 범위 내의 이전 경고에서 얻은 추론을 누적하고 재사용하여 불필요한 탐색을 줄이고 추론 지연 시간을 단축합니다. 실험 결과는 AMER-RCL이 근본 원인 분석의 정확도와 추론 효율성 측면에서 최첨단 방법보다 일관되게 우수한 성능을 보임을 보여줍니다.

Original Abstract

As contemporary microservice systems become increasingly popular and complex-often comprising hundreds or even thousands of fine-grained, interdependent subsystems-they are experiencing more frequent failures. Ensuring system reliability thus demands accurate root cause localization. While many traditional graph-based and deep learning approaches have been explored for this task, they often rely heavily on pre-defined schemas that struggle to adapt to evolving operational contexts. Consequently, a number of LLM-based methods have recently been proposed. However, these methods still face two major limitations: shallow, symptom-centric reasoning that undermines accuracy, and a lack of cross-alert reuse that leads to redundant reasoning and high latency. In this paper, we conduct a comprehensive study of how Site Reliability Engineers (SREs) localize the root causes of failures, drawing insights from professionals across multiple organizations. Our investigation reveals that expert root cause analysis exhibits three key characteristics: recursiveness, multi-dimensional expansion, and cross-modal reasoning. Motivated by these findings, we introduce AMER-RCL, an agentic memory enhanced recursive reasoning framework for root cause localization in microservices. AMER-RCL employs the Recursive Reasoning RCL engine, a multi-agent framework that performs recursive reasoning on each alert to progressively refine candidate causes, while Agentic Memory incrementally accumulates and reuses reasoning from prior alerts within a time window to reduce redundant exploration and lower inference latency. Experimental results demonstrate that AMER-RCL consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.

6 Citations

0 Influential

6 Altmetric

36.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!