A

Andy Peng

Total Citations
102
h-index
2
Papers
2

Publications

#1 2606.11087v1 Jun 09, 2026

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.

Qiyang Li Sergey Levine Andy Peng Zhiyuan Zhou Charles Xu +2
0 Citations
#2 2601.12816v2 Jan 19, 2026

Fisher-Orthogonal Projected Natural Gradient Descent for Continual Learning

Continual learning aims to enable neural networks to acquire new knowledge on sequential tasks. However, the key challenge in such settings is to learn new tasks without catastrophically forgetting previously learned tasks. We propose the Fisher-Orthogonal Projected Natural Gradient Descent (FOPNG) optimizer, which enforces Fisher-orthogonal constraints on parameter updates to preserve old task performance while learning new tasks. Unlike existing methods that operate in Euclidean parameter space, FOPNG projects gradients onto the Fisher-orthogonal complement of previous task gradients. This approach unifies natural gradient descent with orthogonal gradient methods within an information-geometric framework. We provide theoretical analysis deriving the projected update, describe efficient and practical implementations using the diagonal Fisher, and demonstrate strong results on standard continual learning benchmarks such as Permuted-MNIST, Split-MNIST, Rotated-MNIST, Split-CIFAR10, and Split-CIFAR100. Our code is available at https://github.com/ishirgarg/FOPNG.

Ishir Garg Neel Kolhe Andy Peng Rohan Gopalam
1 Citations