2604.24357v1 Apr 27, 2026 cs.LG

DPRM: 디퓨전 언어 모델을 위한 플러그인 Doob h 변환 기반 토큰 순서 결정 모듈

DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

Andi Han

Citations: 107

h-index: 6

Taiji Suzuki

Citations: 110

h-index: 5

Dake Bu

City University of Hong Kong

Citations: 23

h-index: 3

Wei Huang

Citations: 52

h-index: 5

Hau-San Wong

Citations: 21

h-index: 3

Atsushi Nitanda

Citations: 703

h-index: 15

Qingfu Zhang

Citations: 50

h-index: 4

디퓨전 언어 모델은 고정된 왼쪽에서 오른쪽으로의 순서 없이 텍스트를 생성하므로, 토큰 순서 결정은 핵심적인 알고리즘 선택 사항입니다. 각 단계에서 어떤 토큰을 공개하고, 유지하고, 수정하거나 검증해야 할까요? 기존 시스템은 주로 임의 마스킹 또는 신뢰도 기반 순서를 사용합니다. 임의 마스킹은 학습-테스트 불일치를 초래하고, 신뢰도만을 사용하는 방법은 효율적이지만, 때로는 단기적인 판단을 내릴 수 있으며 유용한 탐색을 억제할 수 있습니다. 본 논문에서는 디퓨전 언어 모델을 위한 플러그인 토큰 순서 결정 모듈인 DPRM (Doob h-transform Process Reward Model)을 소개합니다. DPRM은 기존 아키텍처, 노이즈 제거 목표 및 감독 방식을 변경하지 않고 순서 결정 정책만을 변경합니다. DPRM은 신뢰도 기반의 점진적인 순서에서 시작하여 온라인 추정을 통해 Doob h 변환 프로세스 보상 기반 순서로 점진적으로 전환합니다. 저희는 DPRM 정책을 보상 가중 Gibbs 공개 법칙으로 명확하게 정의하고, 단계별 Soft-BoN 근사의 O(1/N) 수렴을 증명했으며, 온라인 버킷화된 컨트롤러가 경험적-베르누이율로 정확한 DPRM 점수를 추적한다는 것을 보였습니다. 또한, 최적화 가능한 가정 하에서 DPRM은 임의 순서 및 신뢰도 기반 순서에 비해 샘플 복잡성 측면에서 이점을 제공합니다. DPRM은 사전 훈련, 사후 훈련, 테스트 시간 확장 및 단일 셀 마스킹 디퓨전에서 신뢰도 기반 기준 모델보다 성능이 향상되며, 특히 어려운 추론 데이터 세트에서 더욱 큰 성능 향상을 보입니다. 단백질, 분자 생성 및 DNA 설계 분야에서는 다목적적인 효과를 나타내며, 순서 인식 변형은 선택된 구조 또는 단편 제약 조건 관련 지표를 크게 개선하는 동시에 모든 품질 지표에서 기존 기준 모델을 일관되게 능가하지는 않습니다. 이러한 결과는 토큰 순서 결정을 디퓨전 언어 모델의 기본적인 제어 요소로 규정하고, DPRM을 이를 개선하기 위한 범용 모듈로 확립합니다. 코드 및 관련 정보는 https://github.com/DakeBU/DPRM-DLLM 에서 확인할 수 있습니다.

Original Abstract

Diffusion language models generate without a fixed left-to-right order, making token ordering a central algorithmic choice: which tokens should be revealed, retained, revised or verified at each step? Existing systems mainly use random masking or confidence-driven ordering. Random masking creates train--test mismatch, while confidence-only rules are efficient but can be myopic and suppress useful exploration. We introduce DPRM (Doob h-transform Process Reward Model), a plug-in token-ordering module for diffusion language models. DPRM keeps the host architecture, denoising objective and supervision unchanged, and changes only the ordering policy. It starts from confidence-driven progressive ordering and gradually shifts to Doob h transform Process Reward guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove O(1/N) convergence of the stagewise Soft-BoN approximation, and show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates. Under tractable optimization assumptions, DPRM also yields a sample-complexity advantage over random and confidence-only ordering. DPRM improves over confidence-based baselines in pretraining, post-training, test-time scaling, and single-cell masked diffusion, with particularly strong gains on harder reasoning subsets. In protein, molecular generation and DNA design, the effect is more multi-objective: ordering-aware variants significantly improve selected structural or fragment-constrained metrics while not uniformly dominating the host baseline on every quality metric. These results identify token ordering as a fundamental control axis in diffusion language models and establish DPRM as a general-purpose module for improving it. Code is available at https://github.com/DakeBU/DPRM-DLLM.

0 Citations

0 Influential

30.9657359028 Altmetric

154.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!