2602.14857v1 Feb 16, 2026 cs.AI

스타크래프트 II의 정책 정제를 위한 세계 모델(World Models)

World Models for Policy Refinement in StarCraft II

Yixin Zhang

Citations: 6

h-index: 1

Yiming Rong

Citations: 7

h-index: 1

Ziyi Wang

Citations: 12

h-index: 2

Jinling Jiang

Citations: 5

h-index: 1

Shuang Xu

Citations: 15

h-index: 2

Shiyu Zhou

Citations: 994

h-index: 17

Bo Xu

Citations: 7

h-index: 1

Hao Wu

Citations: 66

h-index: 2

Haoxi Wang

Citations: 25

h-index: 1

최근 대규모 언어 모델(LLM)은 강력한 추론 및 일반화 능력을 보여주며 복잡한 환경에서의 의사 결정 정책으로 활용되고 있습니다. 방대한 상태-행동(state-action) 공간과 부분 관찰 가능성(partial observability)을 가진 스타크래프트 II(SC2)는 이를 실험하기 위한 까다로운 환경입니다. 그러나 기존의 LLM 기반 SC2 에이전트들은 주로 정책 자체를 개선하는 데에만 초점을 맞추고 있어, 학습 가능하고 행동에 조건화된 전이(transition) 모델을 의사 결정 루프에 통합하는 점은 간과하고 있습니다. 이러한 간극을 메우기 위해, 우리는 부분 관찰 가능성 하에서 미래의 관측(observation)을 예측하는 SC2 최초의 세계 모델인 StarWM을 제안합니다. SC2의 하이브리드 동역학(dynamics) 학습을 돕기 위해, 관측을 5개의 의미 모듈로 분해하는 구조화된 텍스트 표현 방식을 도입하고, SC2 동역학 예측을 위한 최초의 지시 튜닝(instruction-tuning) 데이터셋인 SC2-Dynamics-50k를 구축했습니다. 또한 예측된 구조화된 관측에 대한 다차원 오프라인 평가 프레임워크를 개발했습니다. 오프라인 평가 결과, StarWM은 제로샷(zero-shot) 베이스라인 대비 자원 예측 정확도 및 아군 매크로 상황 일관성에서 약 60%의 향상을 포함하여 상당한 성능 개선을 보였습니다. 마지막으로, 우리는 StarWM을 '생성-시뮬레이션-정제(Generate-Simulate-Refine)' 의사 결정 루프에 통합하여 예측 주도형 정책 정제를 수행하는, 세계 모델로 증강된 의사 결정 시스템인 StarWM-Agent를 제안합니다. SC2 내장 AI를 상대로 한 온라인 평가에서 일관된 성능 향상을 입증하였으며, Hard(LV5), Harder(LV6), VeryHard(LV7) 난이도를 상대로 각각 30%, 15%, 30%의 승률 상승을 달성함과 동시에 매크로 관리의 안정성과 전술적 위험 평가 능력이 개선되었습니다.

Original Abstract

Large Language Models (LLMs) have recently shown strong reasoning and generalization capabilities, motivating their use as decision-making policies in complex environments. StarCraft II (SC2), with its massive state-action space and partial observability, is a challenging testbed. However, existing LLM-based SC2 agents primarily focus on improving the policy itself and overlook integrating a learnable, action-conditioned transition model into the decision loop. To bridge this gap, we propose StarWM, the first world model for SC2 that predicts future observations under partial observability. To facilitate learning SC2's hybrid dynamics, we introduce a structured textual representation that factorizes observations into five semantic modules, and construct SC2-Dynamics-50k, the first instruction-tuning dataset for SC2 dynamics prediction. We further develop a multi-dimensional offline evaluation framework for predicted structured observations. Offline results show StarWM's substantial gains over zero-shot baselines, including nearly 60% improvements in resource prediction accuracy and self-side macro-situation consistency. Finally, we propose StarWM-Agent, a world-model-augmented decision system that integrates StarWM into a Generate--Simulate--Refine decision loop for foresight-driven policy refinement. Online evaluation against SC2's built-in AI demonstrates consistent improvements, yielding win-rate gains of 30%, 15%, and 30% against Hard (LV5), Harder (LV6), and VeryHard (LV7), respectively, alongside improved macro-management stability and tactical risk assessment.

0 Citations

0 Influential

8.5 Altmetric

42.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!