2602.14697v1 Feb 16, 2026 cs.AI

진화적 시스템 프롬프트 학습은 LLM의 강화 학습을 촉진할 수 있다

Evolutionary System Prompt Learning can Facilitate Reinforcement Learning for LLMs

Bradly C. Stadie

Citations: 2,378

h-index: 13

Ryan Chen

Citations: 104

h-index: 5

Lunjun Zhang

Citations: 835

h-index: 8

경험을 통해 자율적으로 자기 개선을 할 수 있는 에이전트 시스템을 구축하는 것은 AI의 오랜 목표입니다. 오늘날 대규모 언어 모델(LLM)은 주로 두 가지 메커니즘, 즉 컨텍스트 업데이트를 위한 자기 성찰(self-reflection)과 가중치 업데이트를 위한 강화 학습(RL)을 통해 자기 개선을 수행합니다. 본 연구에서는 모델 컨텍스트와 모델 가중치를 함께 개선하는 방법인 진화적 시스템 프롬프트 학습(E-SPL)을 제안합니다. 각 RL 반복(iteration)에서 E-SPL은 여러 시스템 프롬프트를 선택하고 각각 병렬로 롤아웃(rollout)을 실행합니다. 이 방법은 각 시스템 프롬프트를 조건으로 모델 가중치에 RL 업데이트를 적용하고, LLM 주도형 변이(mutation) 및 교차(crossover)를 통해 시스템 프롬프트 모집단에 진화적 업데이트를 적용합니다. 각 시스템 프롬프트는 진화적 선택을 위한 TrueSkill 등급을 가지며, 이는 각 RL 반복 배치 내의 상대적 성능을 바탕으로 업데이트됩니다. E-SPL은 프롬프트에 인코딩된 선언적 지식과 가중치에 인코딩된 절차적 지식 간의 자연스러운 분할을 장려하여, 추론 및 에이전트 작업 전반에 걸쳐 성능 향상을 이끌어냅니다. 예를 들어, 쉬움-어려움(AIME -> BeyondAIME) 일반화 설정에서 E-SPL은 RL 성공률을 38.8%에서 45.1%로 향상시켰으며, 성찰적 프롬프트 진화(40.0%)보다 뛰어난 성능을 보였습니다. 전반적으로, 본 연구의 결과는 강화 학습과 시스템 프롬프트 진화를 결합하면 샘플 효율성과 일반화에서 일관된 이득을 얻을 수 있음을 보여줍니다. 코드: https://github.com/LunjunZhang/E-SPL

Original Abstract

Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self-improve via two mechanisms: self-reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E-SPL), a method for jointly improving model contexts and model weights. In each RL iteration, E-SPL selects multiple system prompts and runs rollouts with each in parallel. It applies RL updates to model weights conditioned on each system prompt, and evolutionary updates to the system prompt population via LLM-driven mutation and crossover. Each system prompt has a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration batch. E-SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks. For instance, in an easy-to-hard (AIME $\rightarrow$ BeyondAIME) generalization setting, E-SPL improves RL success rate from 38.8% $\rightarrow$ 45.1% while also outperforming reflective prompt evolution (40.0%). Overall, our results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization. Code: https://github.com/LunjunZhang/E-SPL

1 Citations

0 Influential

26.5 Altmetric

133.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!