2602.12829v1 Feb 13, 2026 cs.LG

FLAC: 운동 에너지 정규화를 통한 최대 엔트로피 강화 학습

FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching

Lei Lv

Citations: 29

h-index: 2

Yunfei Li

Citations: 34

h-index: 3

Yu Luo

Citations: 97

h-index: 5

Fuchun Sun

Citations: 352

h-index: 9

Xiao Ma

Citations: 861

h-index: 13

디퓨전 모델 및 플로우 매칭과 같은 반복적인 생성 정책은 연속적인 제어에 더 높은 표현력을 제공하지만, 행동 로그 밀도를 직접적으로 사용할 수 없기 때문에 최대 엔트로피 강화 학습을 복잡하게 만듭니다. 이러한 문제를 해결하기 위해, 우리는 정책의 확률적 특성을 속도장의 운동 에너지를 페널티로 부과하여 조절하는, 밀도 추정 없이 동작하는 프레임워크인 Field Least-Energy Actor-Critic (FLAC)을 제안합니다. 우리의 핵심 아이디어는 정책 최적화를 고엔트로피 기준 과정(예: 균일 분포)에 대한 일반화된 슈뢰딩거 브리지(Generalized Schrödinger Bridge, GSB) 문제로 정의하는 것입니다. 이러한 관점에서, 최대 엔트로피 원칙은 명시적인 행동 밀도를 필요로 하지 않고, 높은 엔트로피를 가진 기준 과정에 가깝게 유지하면서 보상을 최적화하는 방식으로 자연스럽게 나타납니다. 이 프레임워크에서, 운동 에너지는 기준 과정으로부터의 편차를 나타내는 물리적으로 타당한 지표 역할을 합니다. 경로 공간의 에너지 제한을 최소화함으로써, 유도된 최종 행동 분포의 편차를 제한할 수 있습니다. 이러한 관점을 바탕으로, 우리는 에너지 정규화 정책 반복 체계와, Lagrangian 이중화 메커니즘을 통해 운동 에너지를 자동으로 조정하는 실용적인 오프라인 알고리즘을 도출했습니다. 실험적으로, FLAC은 명시적인 밀도 추정을 피하면서, 강력한 기준 방법과 비교하여 고차원 벤치마크에서 우수한 또는 유사한 성능을 달성했습니다.

Original Abstract

Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose Field Least-Energy Actor-Critic (FLAC), a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field. Our key insight is to formulate policy optimization as a Generalized Schrödinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution. Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.

4 Citations

0 Influential

6.5 Altmetric

36.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!