2605.05812v1 May 07, 2026 cs.AI

장기 예측 Q-러닝: n단계 부등식을 이용한 정확한 가치 함수 학습

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

L. Shi

Citations: 3,906

h-index: 12

Armaan A. Abraham

Citations: 5

h-index: 1

Chelsea Finn

Citations: 469

h-index: 4

Q-러닝과 같이 오프라인(off-policy), 가치 기반 강화 학습 방법은 기존 정책이나 다른 에이전트가 수집한 데이터를 포함하여 임의의 경험으로부터 학습할 수 있다는 점에서 매력적입니다. 그러나 실제로는 부트스트래핑(bootstrapping)으로 인해 장기적인 학습이 불안정해지는 경향이 있습니다. 이는 후기 상태에서의 추정 오류가 시간 차이(TD) 업데이트를 통해 역방향으로 전파되어 시간이 지남에 따라 누적될 수 있기 때문입니다. 본 연구에서는 최적의 행동-가치 함수를 학습할 때 누적 오류에 대한 체계적인 안전 장치를 제공하는 장기 예측 Q-러닝(LQL)을 제안합니다. LQL은 기존의 최적성 강화 관찰을 기반으로 합니다. 즉, 어떤 행동 시퀀스라도 최적 정책이 기대할 수 있는 것보다 낮은 한계를 제시하므로, 몇 단계 동안 관찰된 행동을 따르는 것이 최적 행동으로 전환하기 전에 최적 행동을 하는 것보다 나쁘지 않아야 합니다. 본 연구의 주요 기여는 이러한 부등식을 힌지 손실(hinge loss)을 사용하여 해당 경계를 위반하는 경우를 처벌함으로써 Q-러닝을 위한 실질적인 안정화 메커니즘으로 활용하는 것입니다. 중요한 점은 LQL이 이러한 페널티를 TD 오류에 이미 사용된 네트워크 출력으로 계산하며, 따라서 추가적인 네트워크나 추가적인 순방향 연산을 필요로 하지 않는다는 것입니다. LQL은 다양한 온라인 및 오프라인-온라인 벤치마크에서 최첨단 방법들과 결합하여 사용될 때, 유사한 실행 시간에 1단계 TD 학습 및 n단계 TD 학습보다 일관되게 더 우수한 성능을 보입니다.

Original Abstract

Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a practical stabilization mechanism for Q-learning by using a hinge loss to penalize violations of these bounds. Importantly, LQL computes these penalties using network outputs already produced for the TD error, requiring no auxiliary networks and no additional forward passes relative to Q-learning. When combined with multiple state-of-the-art methods on a range of online and offline-to-online benchmarks, LQL consistently outperforms both 1-step TD and n-step TD learning at similar runtime.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!