2603.20667v1 Mar 21, 2026 cs.SE

REVERE: 과학적 워크플로우를 위한 자기 성찰형 진화 연구 엔지니어

REVERE: Reflective Evolving Research Engineer for Scientific Workflows

Manasi S. Patwardhan

Citations: 343

h-index: 11

Aniketh Garikaparthi

Citations: 23

h-index: 2

Balaji Dinesh Gangireddi

Citations: 1

h-index: 1

A. Cohan

Citations: 31

h-index: 4

기존의 프롬프트 최적화 기술은 주로 지역적인 신호에 의존하여 동작을 업데이트하며, 종종 과제 전반에 걸친 더 넓고 반복적인 패턴을 간과하여 일반화 성능이 저하되는 경향이 있습니다. 또한, 이러한 기술들은 완전한 프롬프트 재작성이나 비정형적인 병합에 의존하여 지식 손실을 초래하기도 합니다. 이러한 한계는 연구 코딩 워크플로우에서 더욱 두드러지는데, 이 워크플로우는 이질적인 저장소, 불명확한 환경, 그리고 제한적인 피드백을 포함합니다. 공개 코드베이스로부터 결과를 재현하는 것이 일반적인 평가 방법입니다. 본 논문에서는 전역 학습 컨텍스트에서 지속적으로 학습하고, 여러 저장소에서의 실행 경로에서 반복되는 실패 모드를 인식하며, 이를 재사용 가능한 휴리스틱으로 추출하고, 시스템 프롬프트, 작업 프롬프트 템플릿, 그리고 누적된 팁 시트의 세 가지 구성 가능한 필드에 대한 표적 수정 작업을 수행하는 프레임워크인 Reflective Evolving Research Engineer (REVERE)를 소개합니다. REVERE는 이 자기 성찰형 최적화 프레임워크를 통해 연구 코딩 작업에서 기존의 전문가가 설계한 지침보다 SUPER에서 4.50%, ResearchCodeBench에서 3.51%, ScienceAgentBench에서 4.89%의 성능 향상을 보였습니다(각 지표 기준). 이러한 결과는 지속적인 학습 메커니즘과 전역 메모리 통합 기능을 갖춘 에이전트가 시간이 지남에 따라 의미 있는 방식으로 기능을 발전시킬 수 있음을 보여줍니다.

Original Abstract

Existing prompt-optimization techniques rely on local signals to update behavior, often neglecting broader and recurring patterns across tasks, leading to poor generalization; they further rely on full-prompt rewrites or unstructured merges, resulting in knowledge loss. These limitations are magnified in research-coding workflows, which involve heterogeneous repositories, underspecified environments, and weak feedback, where reproducing results from public codebases is an established evaluation regime. We introduce Reflective Evolving Research Engineer (REVERE), a framework that continuously learns from Global Training Context, recognizes recurring failure modes in cross-repository execution trajectories, distills them into reusable heuristics, and performs targeted edits across three configurable fields: the system prompt, a task-prompt template, and a cumulative cheatsheet. REVERE, via this reflective optimization framework, improves performance over prior state-of-the-art expert-crafted instructions on research coding tasks by 4.50% on SUPER, 3.51% on ResearchCodeBench, and 4.89% on ScienceAgentBench across their respective metrics. These results demonstrate that agents equipped with mechanisms for continual learning and global memory consolidation can meaningfully evolve their capabilities over time.

1 Citations

0 Influential

5.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!