2605.29398v1 May 28, 2026 cs.LG

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Keyue Jiang
Keyue Jiang
Citations: 28
h-index: 3
Xiaohang Tang
Xiaohang Tang
Citations: 83
h-index: 4
Ilija Bogunovic
Ilija Bogunovic
Citations: 2,084
h-index: 24
Qifang Zhao
Qifang Zhao
Citations: 80
h-index: 6
Sangwoong Yoon
Sangwoong Yoon
Citations: 85
h-index: 4
Xiaoxiao Xu
Xiaoxiao Xu
Citations: 7
h-index: 2
Che Liu
Che Liu
Citations: 13
h-index: 2

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to $+19.6\%$. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.

0 Citations
0 Influential
42.397207708399 Altmetric
212.0 Score
Original PDF
7

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!