2605.06500v1 May 07, 2026 cs.LG

연산자 기반 불변성 학습을 통한 연속 강화 학습

Operator-Guided Invariance Learning for Continuous Reinforcement Learning

Zuyuan Zhang

Citations: 130

h-index: 7

Tian Lan

Citations: 63

h-index: 6

F. Yu

Citations: 15

h-index: 2

연속 시간 및 상태/행동 공간을 갖는 강화 학습(RL)은 종종 많은 데이터를 필요로 하며, 불필요한 변동성과 변화에 취약합니다. 따라서, 학습을 안정화하고 개선하기 위해 값 보존 구조를 활용하는 방법이 필요합니다. 대부분의 기존 접근 방식은 미리 정의된 대칭이나 정확한 등가 변환과 같은 특수한 경우에 초점을 맞추고 있으며, 비선형 연산자를 사용하여 등가 값 함수를 갖는 연속 상태/행동 시스템 간에 변환 및 매핑하는 보다 일반적인 구조를 발견하는 방법에 대한 연구는 부족합니다. 본 논문에서는 **VPSD-RL (Value-Preserving Structure Discovery for Reinforcement Learning)**을 제안합니다. VPSD-RL은 연속 RL을 리 군 작용과 관련된 끌어당김 연산자를 통해 정의된 값 보존 매핑을 갖는 제어된 확산 과정으로 모델링합니다. 값 보존 구조는 값 함수를 끌어당기고 행동을 밀어낼 때 제어된 생성기와 보상 함수와 교환될 때 정확히 존재함을 보여줍니다. 또한, 해밀턴-야코비-벨만 불일치가 작을 때 엄격한 보장을 갖는 근사 값 보존 구조를 찾을 수 있습니다. 이 프레임워크는 관련된 리 군 연산자를 검색하여 정확하고 근사적인 값 보존 구조를 발견합니다. VPSD-RL은 미분 가능한 드리프트, 확산 및 보상 모델을 학습하고, 결정 방정식 잔차 최소화를 통해 미시적 생성자를 학습하며, ODE 흐름을 사용하여 이를 지수화하여 유한한 변환을 얻고, 전이 증강 및 변환 일관성 정규화를 통해 연속 RL에 통합합니다. 경계가 있는 생성자/보상 불일치는 근사 궤적을 따라 최적 값 함수의 정량적 안정성을 의미하며, 안정성은 효과적인 수평에 의해 결정됩니다. 또한, 연속 제어 벤치마크에서 데이터 효율성과 강건성이 향상됨을 관찰했습니다.

Original Abstract

Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing approaches focus on special cases, such as prescribed symmetries and exact equivariance, without addressing how to discover more general structures that require nonlinear operators to transform and map between continuous state/action systems with isomorphic value functions. We propose \textbf{VPSD-RL} (Value-Preserving Structure Discovery for Reinforcement Learning). It models continuous RL as a controlled diffusion with value-preserving mappings defined through Lie-group actions and associated pullback operators. We show that a value-preserving structure exists exactly when pulling back the value function and pushing forward actions commute with the controlled generator and reward functional. Further, approximate value-preserving structures with rigorous guarantees can be found when the Hamilton--Jacobi--Bellman mismatch is small. This framework discovers exact and approximate value-preserving structures by searching for the associated Lie group operators. VPSD-RL fits differentiable drift, diffusion, and reward models; learns infinitesimal generators via determining-equation residual minimization; exponentiates them with ODE flows to obtain finite transformations; and integrates them into continuous RL through transition augmentation and transformation-consistency regularization. We show that bounded generator/reward mismatch implies quantitative stability of the optimal value function along approximate orbits, with sensitivity governed by the effective horizon, and observe improved data efficiency and robustness on continuous-control benchmarks.

2 Citations

0 Influential

3.5 Altmetric

19.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!