2601.14234v1 Jan 20, 2026 cs.LG

수반 매칭(Adjoint Matching)을 활용한 Q-러닝

Q-learning with Adjoint Matching

Citations: 1,929

h-index: 15

Citations: 524

h-index: 7

본 논문에서는 연속 동작 강화 학습(RL)의 오랜 난제인 '파라미터화된 Q-함수에 대한 표현력 높은 디퓨전(diffusion) 또는 흐름 매칭(flow-matching) 정책의 효율적 최적화'를 해결하는 새로운 TD 기반 강화 학습 알고리즘, QAM(Q-learning with Adjoint Matching)을 제안한다. 효과적인 최적화를 위해서는 크리틱(critic)의 1차 정보를 활용해야 하지만, 흐름 또는 디퓨전 정책의 경우 다단계 디노이징 과정을 거치는 직접적인 역전파 기반 최적화가 수치적으로 불안정하여 이를 적용하기 어렵다. 기존 방법들은 기울기 정보를 버리고 값(value)만 사용하거나, 정책의 표현력을 희생하거나 학습된 정책에 편향을 유발하는 근사에 의존하여 이 문제를 우회했다. QAM은 최근 생성 모델링 분야에서 제안된 기법인 수반 매칭(adjoint matching)을 활용하여 이러한 두 가지 문제를 모두 해결한다. 이 기법은 크리틱의 행동 기울기(action gradient)를 변환하여 불안정한 역전파가 필요 없는 단계별 목적 함수를 구성하며, 최적점에서 편향되지 않고 표현력이 뛰어난 정책을 제공한다. 크리틱 학습을 위한 시간차(TD) 백업과 결합된 QAM은 오프라인 및 오프라인-투-온라인 RL 환경의 어렵고 희소한 보상 과제에서 기존 접근 방식들보다 일관되게 우수한 성능을 입증했다.

Original Abstract

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

17 Citations

7 Influential

7.5 Altmetric

68.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!