2604.18578v1 Apr 20, 2026 cs.LG

제한된 비율 강화 학습

Bounded Ratio Reinforcement Learning

Yu Ao

Citations: 47

h-index: 4

Le Chen

Citations: 16

h-index: 2

Bruce D. Lee

Citations: 73

h-index: 4

A. Wahd

Citations: 23

h-index: 3

Aline Czarnobai

Citations: 0

h-index: 0

Philipp Furnstahl

Citations: 86

h-index: 5

Bernhard Scholkopf

Citations: 32

h-index: 2

Andreas Krause

Citations: 42

h-index: 3

확장성과 다양한 영역에서의 실증적 안정성 덕분에 Proximal Policy Optimization (PPO)은 현재 온-정책 강화 학습에서 가장 널리 사용되는 알고리즘입니다. 그러나 신뢰 영역 방법의 근본적인 원리와 PPO에서 사용되는 휴리스틱 기반의 클리핑된 목적 함수의 사이에는 상당한 간극이 존재합니다. 본 논문에서는 Bounded Ratio Reinforcement Learning (BRRL) 프레임워크를 소개하여 이러한 간극을 해소합니다. 우리는 새로운 정규화 및 제약 조건이 적용된 정책 최적화 문제를 정의하고, 그 분석적인 최적 해를 도출합니다. 우리는 이 해가 성능을 단조적으로 향상시키는 것을 증명합니다. 매개변수화된 정책 클래스를 처리하기 위해, 우리는 BRRL에서 도출된 분석적인 최적 해와 정책 간의 장점 가중 다이버전스를 최소화하는 Bounded Policy Optimization (BPO)이라는 정책 최적화 알고리즘을 개발했습니다. 또한, 우리는 BPO 손실 함수와 관련된 결과 정책의 기대 성능에 대한 하한을 설정했습니다. 주목할 만한 점은, 우리의 프레임워크가 PPO 손실의 성공을 해석하는 새로운 이론적 관점을 제공하며, 신뢰 영역 정책 최적화와 Cross-Entropy Method (CEM)를 연결한다는 것입니다. 또한, 우리는 BPO를 LLM 미세 조정에 적용할 수 있도록 Group-relative BPO (GBPO)를 확장했습니다. MuJoCo, Atari, 그리고 복잡한 IsaacLab 환경 (예: Humanoid locomotion)에서의 BPO, 그리고 LLM 미세 조정 작업에서의 GBPO에 대한 실증적 평가 결과, BPO와 GBPO는 일반적으로 안정성과 최종 성능 측면에서 PPO와 GRPO와 동등하거나 우수한 성능을 보이는 것으로 나타났습니다.

Original Abstract

Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!