2605.00347v1 May 01, 2026 cs.LG

오딧세우스: 강화 학습을 통해 게임에서 100턴 이상의 의사 결정이 가능한 대규모 시각-언어 모델 확장

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Ziran Yang

Citations: 123

h-index: 3

Seth Karten

Citations: 62

h-index: 4

Chi Jin

Citations: 357

h-index: 5

Gabriel Sarch

Citations: 332

h-index: 8

Chengshuai Shi

Citations: 580

h-index: 11

Wenzhe Li

Citations: 198

h-index: 5

Xin Liang

Citations: 6

h-index: 2

Yizhou Lu

Citations: 45

h-index: 2

Wenjia Yang

Citations: 3

h-index: 1

Ruirong Feng

Citations: 3

h-index: 1

Zihan Ding

Citations: 406

h-index: 7

Danqi Chen

Citations: 116

h-index: 1

Karthik Narasimhan

Citations: 287

h-index: 3

시각-언어 모델(VLM)의 역량이 빠르게 발전함에 따라, 이를 활용하여 비디오 게임과 같은 상호 작용형 의사 결정 작업에 적용하는 것은 매우 유망한 분야로 떠오르고 있습니다. 그러나 기존의 방법들은 대부분 인간의 행동 경로에 기반한 대규모 지도 학습(SFT)을 사용하거나, 비교적 짧은 시간 범위(일반적으로 20~30턴)에서만 강화 학습(RL)을 적용합니다. 본 연구에서는 시각적 정보를 기반으로 하는 환경인 슈퍼 마리오 랜드에서, 100턴 이상의 상호 작용이 필요한 복합적인 인지, 추론, 행동을 요구하는 장기 의사 결정 작업을 위한 VLM의 강화 학습 기반 훈련을 연구합니다. 우리는 핵심 알고리즘 구성 요소를 체계적으로 조사하고, 가벼운 턴 단위 평가기를 갖춘 PPO의 변형된 방법을 제안하여, GRPO나 Reinforce++와 같은 평가기 없는 방법에 비해 훈련 안정성과 샘플 효율성을 크게 향상시켰습니다. 또한, 사전 훈련된 VLM이 강력한 행동 우선 정보를 제공하여, 강화 학습 훈련 중 샘플 효율성을 크게 향상시키고, 행동 설계와 같은 수동적인 설계 선택의 필요성을 줄인다는 것을 보여줍니다. 이러한 통찰력을 바탕으로, VLM 에이전트의 개방형 훈련 프레임워크인 Odysseus를 소개하며, 게임의 여러 레벨에서 상당한 성능 향상을 달성했으며, 최첨단 모델보다 평균적으로 3배 이상의 게임 진행률을 보였습니다. 또한, 훈련된 모델은 게임 내 및 게임 간 일반화 환경 모두에서 일관된 성능 향상을 보이며, 일반적인 도메인 능력도 유지합니다. 전반적으로, 본 연구의 결과는 강화 학습을 장기적인, 다중 모드 환경에서 안정적이고 효과적으로 만드는 데 필요한 핵심 요소들을 밝히고, VLM을 실제 에이전트로 개발하기 위한 실질적인 지침을 제공합니다.

Original Abstract

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!