2603.18806v1 Mar 19, 2026 cs.AI

dTRPO: 확산 거대 언어 모델의 정책 최적화에서의 경로 축소

dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

Lemeng Wu

Citations: 603

h-index: 6

Changsheng Zhao

Meta

Citations: 1,434

h-index: 10

Ernie Chang

Citations: 436

h-index: 6

Mingchen Zhuge

Citations: 1,525

h-index: 12

Zechun Liu

Citations: 1,928

h-index: 10

Andy Su

Citations: 22

h-index: 2

Hanxian Huang

Citations: 344

h-index: 11

Chong Zhou

Citations: 50

h-index: 4

Raghuraman Krishnamoorthi

Citations: 4,880

h-index: 18

Vikas Chandra

Citations: 1,902

h-index: 15

Mohamed Elhoseiny

Citations: 26

h-index: 2

Wei Wen

Citations: 18

h-index: 3

Wenxuan Zhang

Citations: 1

h-index: 1

Jun Chen

Citations: 104

h-index: 5

확산 거대 언어 모델(dLLM)은 언어 생성에 새로운 패러다임을 제시하며, 이는 동시에 인간의 선호도에 맞추는 데 새로운 과제를 야기합니다. 본 연구에서는 dLLM의 정책 최적화를 개선하기 위해 경로 확률 계산 비용을 줄여, 확장된 오프라인 정책 학습을 가능하게 하는 것을 목표로 합니다. 우리는 다음과 같은 점을 증명합니다: (i) 참조 정책 정규화를 통해, 새로 노출된 토큰의 확률 비율은 중간 확산 상태의 확률 비율의 편향되지 않은 추정치이며, (ii) 전체 경로의 확률은 재-마스킹된 최종 상태에 대한 단일 순방향 패스만으로 효과적으로 추정할 수 있습니다. 이러한 두 가지 경로 축소 전략을 정책 최적화 목표에 통합하여, Trajectory Reduction Policy Optimization (dTRPO)를 제안합니다. 우리는 dTRPO를 7B dLLM 모델에 대해 지시 따르기 및 추론 벤치마크에서 평가했습니다. 결과는 dTRPO가 최첨단 dLLM 모델의 핵심 성능을 크게 향상시킨다는 것을 보여줍니다. STEM 작업에서 최대 9.6%, 코딩 작업에서 최대 4.3%, 지시 따르기 작업에서 최대 3.0%의 성능 향상을 달성했습니다. 또한, dTRPO는 오프라인 및 단일 순방향 특성으로 인해 높은 학습 효율성을 보이며, 고품질 출력물을 통해 개선된 생성 효율성을 달성합니다.

Original Abstract

Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Moreover, dTRPO exhibits strong training efficiency due to its offline, single-forward nature, and achieves improved generation efficiency through high-quality outputs.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!