2602.16928v2 Feb 18, 2026 cs.GT

대규모 언어 모델을 활용한 다중 에이전트 학습 알고리즘 탐색

Discovering Multiagent Learning Algorithms with Large Language Models

Marc Lanctot

Citations: 28

h-index: 3

John Schultz

Citations: 47

h-index: 3

Daniel Hennes

Citations: 34

h-index: 2

Zun Li

Citations: 65

h-index: 5

불완전 정보 게임 분야에서 다중 에이전트 강화 학습(MARL)의 발전은 역사적으로 기본 알고리즘의 반복적인 수동 개선에 크게 의존해 왔습니다. Counterfactual Regret Minimization (CFR) 및 Policy Space Response Oracles (PSRO)와 같은 핵심 알고리즘들은 견고한 이론적 기반을 가지고 있지만, 가장 효과적인 변형을 설계하는 것은 종종 방대한 알고리즘 설계 공간을 탐색하기 위한 인간의 직관에 의존합니다. 본 연구에서는 대규모 언어 모델로 구동되는 진화적 코딩 에이전트인 AlphaEvolve를 사용하여 새로운 다중 에이전트 학습 알고리즘을 자동으로 탐색하는 방법을 제안합니다. 본 프레임워크의 일반성을 입증하기 위해, 게임 이론 학습의 두 가지 상이한 패러다임에 대한 새로운 변형을 진화시켰습니다. 첫째, 반복적인 후회 최소화 영역에서, 우리는 후회 누적 및 정책 도출 방식을 진화시켜 새로운 알고리즘인 Volatility-Adaptive Discounted (VAD-)CFR를 발견했습니다. VAD-CFR은 변동성에 민감한 할인, 일관성을 강화하는 낙관성, 그리고 엄격한 초기 정책 누적 스케줄과 같은 새로운, 직관에 어긋나는 메커니즘을 사용하여 Discounted Predictive CFR+와 같은 최첨단 알고리즘보다 우수한 성능을 보였습니다. 둘째, 개체군 기반 학습 알고리즘 영역에서, 우리는 PSRO의 학습 시간 및 평가 시간 메타 전략 솔버를 진화시켜 새로운 변형인 Smoothed Hybrid Optimistic Regret (SHOR-)PSRO를 발견했습니다. SHOR-PSRO는 최적 후회 매칭과 최적 순수 전략 분포를 부드럽게 혼합하는 하이브리드 메타 솔버를 도입합니다. 이 혼합 비율과 다양성 보너스를 학습 중에 동적으로 조정함으로써, 알고리즘은 개체군 다양성에서 엄격한 균형 탐색으로의 전환을 자동화하여 표준 정적 메타 솔버보다 우수한 경험적 수렴성을 제공합니다.

Original Abstract

Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid theoretical ground, the design of their most effective variants often relies on human intuition to navigate a vast algorithmic design space. In this work, we propose the use of AlphaEvolve, an evolutionary coding agent powered by large language models, to automatically discover new multiagent learning algorithms. We demonstrate the generality of this framework by evolving novel variants for two distinct paradigms of game-theoretic learning. First, in the domain of iterative regret minimization, we evolve the logic governing regret accumulation and policy derivation, discovering a new algorithm, Volatility-Adaptive Discounted (VAD-)CFR. VAD-CFR employs novel, non-intuitive mechanisms-including volatility-sensitive discounting, consistency-enforced optimism, and a hard warm-start policy accumulation schedule-to outperform state-of-the-art baselines like Discounted Predictive CFR+. Second, in the regime of population based training algorithms, we evolve training-time and evaluation-time meta strategy solvers for PSRO, discovering a new variant, Smoothed Hybrid Optimistic Regret (SHOR-)PSRO. SHOR-PSRO introduces a hybrid meta-solver that linearly blends Optimistic Regret Matching with a smoothed, temperature-controlled distribution over best pure strategies. By dynamically annealing this blending factor and diversity bonuses during training, the algorithm automates the transition from population diversity to rigorous equilibrium finding, yielding superior empirical convergence compared to standard static meta-solvers.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!