2606.06096v1 Jun 04, 2026 cs.LG

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

Takeshi Kojima
Takeshi Kojima
Citations: 7,530
h-index: 7
Yusuke Iwasawa
Yusuke Iwasawa
Citations: 10,847
h-index: 23
Yutaka Matsuo
Yutaka Matsuo
Citations: 1,565
h-index: 13
Shota Takashiro
Shota Takashiro
Citations: 33
h-index: 2
Soichiro Nishimori
Soichiro Nishimori
Citations: 89
h-index: 4
Paavo Parmas
Paavo Parmas
Citations: 247
h-index: 6
Kohsei Matsutani
Kohsei Matsutani
Citations: 33
h-index: 3
Yongmin Kim
Yongmin Kim
Citations: 8
h-index: 1

Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad

1 Citations
0 Influential
36.993061443341 Altmetric
186.0 Score
Original PDF
2

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!