V

V. Lai

Total Citations
161
h-index
5
Papers
2

Publications

#1 2605.26013v1 May 25, 2026

AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models

We introduce AdvantageFlow, a forward-process reinforcement learning algorithm for rectified flow models. Unlike Flow-GRPO, which optimizes the reverse process, we optimize an advantage-weighted forward-process prediction loss. This optimization problem is unstable when advantages are negative and the loss becomes non-convex. We stabilize it by rollout policy regularization, which reduces variance and arises from fitting a local reward-improving target distribution. We evaluate AdvantageFlow on image generation tasks with Stable Diffusion 3.5 Medium. It outperforms both Flow-GRPO and a state-of-the-art forward-process RL baseline based on negative-aware fine-tuning.

B. Kveton Subhojyoti Mukherjee V. Lai K. Singh Anup Rao
0 Citations
#2 2603.06138v1 Mar 06, 2026

Partial Policy Gradients for RL in LLMs

Reinforcement learning is a framework for learning to act sequentially in an unknown environment. We propose a natural approach for modeling policy structure in policy gradients. The key idea is to optimize for a subset of future rewards: smaller subsets represent simpler policies, which can be learned more reliably because their empirical gradient estimates are more accurate. Our approach allows for modeling and comparison of different policy classes, including full planning, greedy, K-step lookahead, and segment policies. We evaluate the policies empirically on multiple persona-alignment conversational problems. Different policies excel in different problems, reflecting their different characteristics and highlighting the importance of our studied policy class.

Puneet Mathur B. Kveton Subhojyoti Mukherjee V. Lai
0 Citations