2603.12145v1 Mar 12, 2026 cs.LG

고성능 강화학습 환경의 자동 생성

Automatic Generation of High-Performance RL Environments

Seth Karten

Citations: 40

h-index: 4

Rahul Dev Appapogu

Citations: 0

h-index: 0

Chi Jin

Citations: 318

h-index: 5

복잡한 강화학습(RL) 환경을 고성능으로 구현하는 데는 전통적으로 상당한 시간과 전문적인 기술이 필요했습니다. 본 연구에서는 재사용 가능한 방법론을 제시합니다. 이 방법론은 일반적인 프롬프트 템플릿, 계층적 검증, 그리고 반복적인 에이전트 기반 수정 과정을 통해 의미적으로 동일하면서도 고성능의 환경을 10달러 미만의 컴퓨팅 비용으로 생성합니다. 우리는 다섯 개의 환경에 걸쳐 세 가지 서로 다른 워크플로우를 시연합니다. 기존의 성능 구현체가 없는 환경에 대한 직접 번역: EmuRust (Game Boy 에뮬레이터의 Rust 병렬화를 통한 PPO 속도 1.5배 향상), PokeJAX (GPU 병렬 Pokemon 전투 시뮬레이터, 무작위 동작 시 500M SPS, PPO 시 15.2M SPS; TypeScript 레퍼런스 대비 22,320배 향상). 기존 성능 구현체와의 비교 검증: MJX와 동일한 처리량 (1.04배), Brax보다 5배 높은 처리량 (HalfCheetah JAX, 동일 GPU 배치 크기 기준); Puffer Pong의 PPO 성능은 42배 향상. 새로운 환경 생성: 웹에서 추출한 사양을 기반으로 구축된 최초의 배포 가능한 JAX Pokemon TCG 엔진인 TCGJax (무작위 동작 시 717K SPS, PPO 시 153K SPS; Python 레퍼런스 대비 6.6배 향상). 200만 개의 파라미터를 사용하는 환경의 오버헤드는 학습 시간의 4% 미만입니다. 계층적 검증 (속성, 상호 작용, 롤아웃 테스트)을 통해 모든 다섯 가지 환경의 의미적 동등성이 확인되었으며, 서로 다른 백엔드 간 정책 전송을 통해 모든 다섯 가지 환경에서 시뮬레이션 간의 격차가 없음을 확인했습니다. 공개 저장소에 없는 비공개 레퍼런스를 기반으로 생성된 TCGJax는 에이전트 사전 학습 데이터의 잠재적인 오염 문제를 방지하는 역할을 합니다. 본 논문에는 대표적인 프롬프트, 검증 방법론, 그리고 완전한 결과 데이터를 포함하여, 코딩 에이전트가 논문의 내용을 직접 재현할 수 있을 만큼 충분한 상세 정보가 포함되어 있습니다.

Original Abstract

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!