2603.28386v1 Mar 30, 2026 cs.AI

COvolve: 두 명의 플레이어가 참여하는 제로섬 게임을 통한 대규모 언어 모델 기반 정책 및 환경의 적대적 공진화

COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game

Pedro Zuidberg Dos Martires

Citations: 383

h-index: 12

Rishi Hazra

Doctoral student, AASS Research Centre, Örebro University (WASP graduate school)

Citations: 211

h-index: 8

Alkis Sygkounas

Citations: 36

h-index: 3

Amy Loutfi

Citations: 49

h-index: 3

A. Persson

Citations: 115

h-index: 5

지속적으로 성능이 향상되는 에이전트를 구축하는 데 있어 중요한 과제는 훈련 환경이 일반적으로 정적이며 수동으로 구성된다는 점입니다. 이는 지속적인 학습과 훈련 데이터 분포를 벗어난 일반화 능력을 제한합니다. 우리는 COvolve라는 공진화 프레임워크를 통해 이 문제를 해결하고자 합니다. COvolve는 대규모 언어 모델(LLM)을 활용하여 환경과 에이전트 정책을 생성하며, 이 정책은 실행 가능한 Python 코드로 표현됩니다. 우리는 환경 설계자와 정책 설계자 간의 상호 작용을 두 명의 플레이어가 참여하는 제로섬 게임으로 모델링하여, 환경이 정책의 약점을 드러내고 정책이 이에 적응하는 적대적인 공진화를 보장합니다. 이 과정을 통해 환경과 정책이 점진적으로 복잡성을 높이는 자동화된 교육 과정을 유도합니다. 교육 과정이 진행됨에 따라 견고성을 보장하고 이전 정보를 잊지 않도록, 우리는 제로섬 게임의 혼합 전략 내쉬 균형(MSNE)을 계산하여 메타 정책을 얻습니다. 이 MSNE 메타 정책은 에이전트가 이전에 학습한 환경을 해결하는 방법을 잊지 않으면서 이전에 보지 못한 환경을 해결하는 방법을 학습하도록 보장합니다. 도시 주행, 기호 미로 해결, 기하학적 탐색에서의 실험 결과는 COvolve가 점진적으로 복잡한 환경을 생성한다는 것을 보여줍니다. 우리의 결과는 LLM 기반 공진화가 미리 정의된 작업 분포나 수동 개입 없이도 개방형 학습을 달성할 수 있는 잠재력을 가지고 있음을 보여줍니다.

Original Abstract

A central challenge in building continually improving agents is that training environments are typically static or manually constructed. This restricts continual learning and generalization beyond the training distribution. We address this with COvolve, a co-evolutionary framework that leverages large language models (LLMs) to generate both environments and agent policies, expressed as executable Python code. We model the interaction between environment and policy designers as a two-player zero-sum game, ensuring adversarial co-evolution in which environments expose policy weaknesses and policies adapt in response. This process induces an automated curriculum in which environments and policies co-evolve toward increasing complexity. To guarantee robustness and prevent forgetting as the curriculum progresses, we compute the mixed-strategy Nash equilibrium (MSNE) of the zero-sum game, thereby yielding a meta-policy. This MSNE meta-policy ensures that the agent does not forget to solve previously seen environments while learning to solve previously unseen ones. Experiments in urban driving, symbolic maze-solving, and geometric navigation showcase that COvolve produces progressively more complex environments. Our results demonstrate the potential of LLM-driven co-evolution to achieve open-ended learning without predefined task distributions or manual intervention.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!