2602.02488v1 Feb 02, 2026 cs.LG

RLAnything: 완전 동적 강화 학습 시스템에서 환경, 정책 및 보상 모델을 동적으로 생성하는 프레임워크

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Ke Shen

Citations: 446

h-index: 8

Yinjie Wang

Citations: 222

h-index: 6

Tian Xie

Citations: 252

h-index: 5

Mengdi Wang

Citations: 745

h-index: 13

Ling Yang

Citations: 538

h-index: 8

본 논문에서는 RLAnything을 제안합니다. RLAnything은 폐쇄 루프 최적화를 통해 환경, 정책 및 보상 모델을 동적으로 생성하는 강화 학습 프레임워크이며, 이를 통해 학습 신호를 증폭하고 모든 LLM 또는 에이전트 기반 시나리오에 대한 전체 강화 학습 시스템을 강화합니다. 구체적으로, 정책은 단계별 및 결과 신호로부터 통합된 피드백을 통해 학습되며, 보상 모델은 일관성 피드백을 통해 공동으로 최적화되어 정책 학습을 더욱 향상시킵니다. 또한, 이론적으로 뒷받침되는 자동 환경 적응은 각 모델의 비평자 피드백을 활용하여 보상 및 정책 모델 모두의 학습을 개선하며, 경험을 통한 학습을 가능하게 합니다. 실험적으로, 추가된 각 구성 요소는 전체 시스템을 지속적으로 개선하며, RLAnything은 다양한 LLM 및 에이전트 기반 작업에서 상당한 성능 향상을 보입니다. 구체적으로, OSWorld에서 Qwen3-VL-8B-Thinking 모델의 성능을 9.1% 향상시키고, AlfWorld 및 LiveBench에서 각각 Qwen2.5-7B-Instruct 모델의 성능을 18.7% 및 11.9% 향상시켰습니다. 또한, 최적화된 보상 모델 신호가 인간 레이블에 의존하는 결과보다 우수한 성능을 발휘한다는 것을 확인했습니다. 코드: https://github.com/Gen-Verse/Open-AgentRL

Original Abstract

We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen-Verse/Open-AgentRL

8 Citations

0 Influential

55.182861487396 Altmetric

283.9 Score

Original PDF

309

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!