2601.23010v1 Jan 30, 2026 cs.LG

오프라인 강화 학습을 위한 연속 제약 조건 보간 프레임워크 기반 자동 제약 정책 최적화

Automatic Constraint Policy Optimization based on Continuous Constraint Interpolation Framework for Offline Reinforcement Learning

Xinchen Han

Citations: 28

h-index: 3

Hossam Afifi

Citations: 66

h-index: 4

M. Marot

Citations: 452

h-index: 12

Qiuyang Fang

Citations: 0

h-index: 0

오프라인 강화 학습(RL)은 외삽 오차를 완화하기 위해 정책 제약을 사용하며, 이때 제약 조건의 형태와 강도가 성능에 중요한 영향을 미칩니다. 그러나 대부분의 기존 방법은 가중 행동 복제, 밀도 정규화 또는 지원 제약과 같은 단일 제약 조건 패밀리에 의존하며, 이들 간의 연결성이나 상호작용에 대한 통합적인 원리가 부족합니다. 본 연구에서는 이러한 제약 조건 패밀리를 하나의 공통 제약 스펙트럼을 따라 특수한 경우로 나타내는 통합 최적화 프레임워크인 연속 제약 조건 보간(CCI)을 제안합니다. CCI 프레임워크는 단일 보간 매개변수를 도입하여 다양한 제약 조건 유형 간에 원활한 전환과 체계적인 조합을 가능하게 합니다. CCI를 기반으로, 우리는 Lagrangian 이중 업데이트를 통해 보간 매개변수를 조정하는 실용적인 원-이중 알고리즘인 자동 제약 정책 최적화(ACPO)를 개발했습니다. 또한, 최대 엔트로피 성능 차이 정리를 확립하고, 폐구형 최적 정책 및 그 매개변수 근사값 모두에 대한 성능 하한을 도출했습니다. D4RL 및 NeoRL2 데이터셋에 대한 실험 결과, 다양한 도메인에서 뛰어난 성능 향상을 보였으며, 전반적으로 최첨단 성능을 달성했습니다.

Original Abstract

Offline Reinforcement Learning (RL) relies on policy constraints to mitigate extrapolation error, where both the constraint form and constraint strength critically shape performance. However, most existing methods commit to a single constraint family: weighted behavior cloning, density regularization, or support constraints, without a unified principle that explains their connections or trade-offs. In this work, we propose Continuous Constraint Interpolation (CCI), a unified optimization framework in which these three constraint families arise as special cases along a common constraint spectrum. The CCI framework introduces a single interpolation parameter that enables smooth transitions and principled combinations across constraint types. Building on CCI, we develop Automatic Constraint Policy Optimization (ACPO), a practical primal--dual algorithm that adapts the interpolation parameter via a Lagrangian dual update. Moreover, we establish a maximum-entropy performance difference lemma and derive performance lower bounds for both the closed-form optimal policy and its parametric projection. Experiments on D4RL and NeoRL2 demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!