2605.06149v1 May 07, 2026 cs.LG

AdaGamma: 강화 학습에서의 시간적 적응을 위한 상태 의존적 할인 방법

AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

Yu Zhang

Citations: 40

h-index: 3

Jianting Pan

Citations: 5

h-index: 1

Ran Tian

Citations: 83

h-index: 5

Hengle Qin

Citations: 58

h-index: 1

Xiaoyang Li

Citations: 110

h-index: 3

Tianshu Yu

Citations: 41

h-index: 2

Yaomin Wang

Citations: 2

h-index: 1

강화 학습에서 할인 계수는 효과적인 계획 수평과 부트스트래핑의 강도를 모두 제어하지만, 대부분의 딥 강화 학습 방법은 모든 상태에서 단일한 고정 값을 사용합니다. 상태 의존적 할인은 개념적으로 매력적이지만, 단순한 딥 액터-크리틱 구현은 불안정해지고 TD-오류 붕괴로 이어질 수 있습니다. 본 논문에서는 상태 의존적 할인을 위한 실용적인 딥 액터-크리틱 방법인 AdaGamma을 제안합니다. AdaGamma은 상태 의존적 할인 함수를 학습하고, 유도된 백업 구조를 정규화하기 위한 수익 일관성 객관 함수를 함께 사용합니다. 이론적으로, 상태 의존적 할인이 유도하는 벨만 연산자를 분석하고, 적절한 조건 하에서 해당 연산자의 기본적인 잘 정의된 속성을 확립합니다. 실험적으로, AdaGamma은 SAC 및 PPO에 통합되어 연속 제어 벤치마크에서 일관된 성능 향상을 보이며, JD 물류 플랫폼의 온라인 A/B 테스트에서 통계적으로 유의미한 이점을 달성했습니다. 이러한 결과는 상태 의존적 할인이 수익 일관성 객관 함수와 함께 사용될 때 딥 강화 학습에서 효과적으로 적용될 수 있음을 시사합니다. 이 객관 함수는 비생성적인 목표 조작을 방지합니다.

Original Abstract

The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor--critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor--critic method for state-dependent discounting that learns a state-dependent discount function together with a return-consistency objective to regularize the induced backup structure. On the theory side, we analyze the Bellman operator induced by state-dependent discounting and establish its basic well-posedness properties under suitable conditions. Empirically, AdaGamma integrates into both SAC and PPO, yielding consistent improvements on continuous-control benchmarks, and achieves statistically significant gains in an online A/B test on the JD Logistics platform. These results suggest that state-dependent discounting can be made effective in deep RL when coupled with a return-consistency objective that prevents degenerate target manipulation.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!