2606.06096v1 Jun 04, 2026 cs.LG

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

Takeshi Kojima

Citations: 7,530

h-index: 7

Yusuke Iwasawa

Citations: 10,847

h-index: 23

Yutaka Matsuo

Citations: 1,565

h-index: 13

Shota Takashiro

Citations: 33

h-index: 2

Soichiro Nishimori

Citations: 89

h-index: 4

Paavo Parmas

Citations: 247

h-index: 6

Kohsei Matsutani

Citations: 33

h-index: 3

Yongmin Kim

Citations: 8

h-index: 1

Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad

1 Citations

0 Influential

36.993061443341 Altmetric

186.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!