2603.14245v1 Mar 15, 2026 cs.LG

GoldenStart: Q-가이드 우선순위 및 엔트로피 제어를 통한 플로우 정책 증류

GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies

Ying Sun

Citations: 181

h-index: 6

He Zhang

Citations: 966

h-index: 7

Hui Xiong

Citations: 15

h-index: 2

플로우 매칭 정책은 복잡하고 다중 모드 액션 분포를 포착하여 강화 학습(RL)에 큰 잠재력을 가지고 있습니다. 그러나 실제 적용에는 종종 엄청난 추론 지연과 비효율적인 온라인 탐색 문제가 발생합니다. 최근 연구에서는 빠른 추론을 위해 원-스텝 증류를 사용했지만, 초기 노이즈 분포의 구조는 간과된 요소이며, 이는 상당한 잠재력을 가지고 있습니다. 이 간과된 요소와 정책의 확률적 특성 제어의 어려움은 증류된 플로우 매칭 정책을 발전시키는 데 있어 중요한 두 가지 영역을 구성합니다. 이러한 한계를 극복하기 위해, 우리는 Q-가이드 우선순위와 명시적인 엔트로피 제어를 갖춘 정책 증류 방법인 GoldenStart (GSFlow)를 제안합니다. 초기 노이즈로부터 무정보적으로 시작하는 대신, 우리는 조건부 VAE로 모델링된 Q-가이드 우선순위를 도입합니다. 이 상태에 조건화된 우선순위는 원-스텝 생성 프로세스의 시작점을 높은 Q 영역으로 이동시켜, 유망한 액션으로 정책을 직접 연결하는 "황금의 시작"을 제공합니다. 또한, 효과적인 온라인 탐색을 위해, 우리의 증류된 액터는 결정적인 값을 출력하는 대신 확률적 분포를 출력하도록 합니다. 이는 엔트로피 정규화를 통해 관리되며, 정책이 순수한 활용에서 체계적인 탐색으로 전환될 수 있도록 합니다. 우리의 통합 프레임워크는 생성 시작점을 설계하고 정책 엔트로피를 명시적으로 제어함으로써 효율적이고 탐색적인 정책을 달성할 수 있음을 보여줍니다. 우리는 오프라인 및 온라인 연속 제어 벤치마크에서 광범위한 실험을 수행했으며, 우리의 방법은 기존 최고 성능 접근 방식보다 훨씬 우수한 성능을 보였습니다. 코드는 https://github.com/ZhHe11/GSFlow-RL 에서 확인할 수 있습니다.

Original Abstract

Flow-matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi-modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one-step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow-matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q-guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q-guided prior modeled by a conditional VAE. This state-conditioned prior repositions the starting points of the one-step generation process into high-Q regions, effectively providing a "golden start" that shortcuts the policy to promising actions. Furthermore, for effective online exploration, we enable our distilled actor to output a stochastic distribution instead of a deterministic point. This is governed by entropy regularization, allowing the policy to shift from pure exploitation to principled exploration. Our integrated framework demonstrates that by designing the generative startpoint and explicitly controlling policy entropy, it is possible to achieve efficient and exploratory policies, bridging the generative models and the practical actor-critic methods. We conduct extensive experiments on offline and online continuous control benchmarks, where our method significantly outperforms prior state-of-the-art approaches. Code will be available at https://github.com/ZhHe11/GSFlow-RL.

0 Citations

0 Influential

26.9657359028 Altmetric

134.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!