2602.21611v1 Feb 25, 2026 cs.SE

구조적으로 정렬된 하위 작업 수준의 메모리를 활용한 소프트웨어 엔지니어링 에이전트

Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Wencong Zeng

Citations: 37

h-index: 3

Kangning Shen

Citations: 1

h-index: 1

Jingyuan Zhang

Citations: 117

h-index: 5

Chenxi Sun

Citations: 25

h-index: 2

Yang Yue

Citations: 149

h-index: 4

대규모 언어 모델(LLM)은 자율적인 소프트웨어 엔지니어링(SWE) 에이전트로서 상당한 잠재력을 보여주었습니다. 최근 연구에서는 이러한 에이전트에 메모리 메커니즘을 추가하여 장기적인 추론을 지원하는 방법을 탐구했습니다. 그러나 이러한 접근 방식은 일반적으로 거친 수준의 인스턴스 단위로 작동하며, 문제 해결 전체 과정을 저장 및 검색의 기본 단위로 취급합니다. 본 연구에서는 인스턴스 수준의 메모리가 근본적인 수준 불일치를 겪으며, 유사한 표면 설명을 가진 작업들이 특정 단계에서 서로 다른 추론 로직을 필요로 할 때 잘못된 검색 결과를 초래한다는 것을 경험적으로 입증합니다. 이러한 문제를 해결하기 위해, 에이전트의 기능 분해와 메모리 저장, 검색, 업데이트를 일치시키는 '구조적으로 정렬된 하위 작업 수준의 메모리'라는 방법을 제안합니다. SWE-bench Verified 데이터셋에 대한 광범위한 실험 결과, 제안하는 방법은 다양한 모델 구조에서 기존 에이전트와 강력한 인스턴스 수준 메모리 기반 모델보다 일관되게 우수한 성능을 보이며, 평균 Pass@1 점수가 기존 에이전트에 비해 평균 +4.7% 포인트 향상되었습니다 (예: Gemini 2.5 Pro의 경우 +6.8% 포인트 향상). 성능 향상은 상호 작용 단계가 증가함에 따라 더욱 두드러지며, 이는 과거 경험을 활용하는 것이 복잡한 소프트웨어 엔지니어링 작업에서 장기적인 추론에 도움이 된다는 것을 보여줍니다.

Original Abstract

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents. Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning. However, these approaches typically operate at a coarse instance granularity, treating the entire problem-solving episode as the atomic unit of storage and retrieval. We empirically demonstrate that instance-level memory suffers from a fundamental granularity mismatch, resulting in misguided retrieval when tasks with similar surface descriptions require distinct reasoning logic at specific stages. To address this, we propose Structurally Aligned Subtask-Level Memory, a method that aligns memory storage, retrieval, and updating with the agent's functional decomposition. Extensive experiments on SWE-bench Verified demonstrate that our method consistently outperforms both vanilla agents and strong instance-level memory baselines across diverse backbones, improving mean Pass@1 over the vanilla agent by +4.7 pp on average (e.g., +6.8 pp on Gemini 2.5 Pro). Performance gains grow with more interaction steps, showing that leveraging past experience benefits long-horizon reasoning in complex software engineering tasks.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!