2604.20987v1 Apr 22, 2026 cs.AI

장기 과제 수행을 위한 LLM 기반 의사 결정 에이전트 및 기술 저장소의 공동 진화

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Tyler Marques

Citations: 8

h-index: 2

Matthew Lyle Olson

Citations: 8

h-index: 2

Alexander Duffy

Citations: 12

h-index: 3

Zongxia Li

Citations: 912

h-index: 7

Xiyang Wu

Citations: 322

h-index: 8

Tianyi Zhou

Citations: 8

h-index: 2

Guangyao Shi

Citations: 269

h-index: 5

Dinesh Manocha

Citations: 16

h-index: 3

장기적인 상호작용 환경은 에이전트의 기술 활용 능력을 평가하는 데 유용한 도구입니다. 이러한 환경은 다단계 추론, 여러 시간 단계에 걸친 기술 연결, 지연된 보상 및 부분 관찰 하에서의 견고한 의사 결정을 요구합니다. 게임은 에이전트의 기술 활용 능력을 평가하는 데 적합한 환경입니다. 대규모 언어 모델(LLM)은 게임 에이전트로서 유망한 대안을 제공하지만, 에피소드 간에 구조화된 기술을 발견, 유지 및 재사용할 수 있는 메커니즘이 부족하여 일관성 있는 장기 의사 결정을 하는 데 어려움을 겪는 경우가 많습니다. 본 연구에서는 COSPLAY라는 공동 진화 프레임워크를 제안합니다. 이 프레임워크에서 LLM 기반 의사 결정 에이전트는 학습 가능한 기술 저장소에서 기술을 검색하여 행동을 안내하고, 에이전트가 관리하는 기술 파이프라인은 에이전트의 비표시 데이터 실행 결과를 분석하여 재사용 가능한 기술을 발견하고 기술 저장소를 구축합니다. COSPLAY 프레임워크는 의사 결정 에이전트가 더 나은 기술 검색 및 행동 생성 능력을 갖추도록 하며, 기술 저장소 에이전트는 지속적으로 기술을 추출, 개선 및 업데이트하고, 각 기술에 대한 명세(contract)를 함께 관리합니다. 6개의 게임 환경에서 진행된 실험 결과, COSPLAY는 80억 개의 매개변수를 가진 기본 모델을 사용하여, 싱글 플레이어 게임 벤치마크에서 4개의 최첨단 LLM 모델을 기준으로 평균 25.1% 이상의 보상 향상을 달성했으며, 멀티 플레이어 사회적 추론 게임에서도 경쟁력 있는 성능을 보였습니다.

Original Abstract

Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.

4 Citations

0 Influential

4 Altmetric

24.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!