2602.07441v1 Feb 07, 2026 cs.LG

오프라인 강화 학습에서의 행동 복제 액터-크리틱 방법론을 위한 근접 액션 대체

Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

Jinzong Dong

Citations: 41

h-index: 5

Wei Huang

Citations: 4

h-index: 1

Jianshu Zhang

Citations: 27

h-index: 3

Zhuo Chen

Citations: 18

h-index: 2

Xin Yuan

Citations: 57

h-index: 2

Qinying Gu

Citations: 194

h-index: 9

Zhaohui (Zoey) Jiang

Citations: 82

h-index: 3

Nan Ye

Citations: 3

h-index: 1

오프라인 강화 학습(RL)은 사전에 수집된 정적 데이터 세트를 기반으로 정책을 최적화하며, 강화 학습의 중요한 분야입니다. 널리 사용되는 접근 방식 중 하나는 액터-크리틱 방법을 행동 복제(BC)로 규제하는 것으로, 이는 현실적인 정책을 생성하고 데이터 분포 외부의 행동으로 인한 편향을 완화하지만, 종종 간과되는 성능 한계를 초래할 수 있습니다. 즉, 데이터 세트의 행동이 최적이 아닌 경우, 무분별한 모방은 액터가 크리틱이 제시하는 고가치 영역을 충분히 활용하는 것을 방해하며, 특히 모방이 이미 지배적인 후반 학습 단계에서 이러한 문제가 더욱 심각해집니다. 우리는 행동 복제를 통해 규제된 액터-크리틱 최적화의 수렴 특성을 분석하여 이러한 제한점을 공식적으로 분석하고, 제어된 연속 밴딧 작업에서 이를 검증했습니다. 이러한 한계를 극복하기 위해, 우리는 안정적인 액터가 생성한 고가치 행동으로 저가치 행동을 점진적으로 대체하는, 즉시 사용 가능한 학습 샘플 대체 기법인 근접 액션 대체(PAR)를 제안합니다. PAR은 다양한 행동 복제 규제 패러다임과 호환됩니다. 오프라인 강화 학습 벤치마크에서 수행한 광범위한 실험 결과, PAR은 일관되게 성능을 향상시키며, 기본 TD3+BC와 결합될 때 최첨단 수준에 근접하는 결과를 보여줍니다.

Original Abstract

Offline reinforcement learning (RL) optimizes policies from a previously collected static dataset and is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which yields realistic policies and mitigates bias from out-of-distribution actions, but can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting high-value regions suggested by the critic, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), a plug-and-play training sample replacer that progressively replaces low-value actions with high-value actions generated by a stable actor, broadening the action exploration space while reducing the impact of low-value data. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance and approaches state-of-the-art when combined with the basic TD3+BC.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!