2601.22450v1 Jan 30, 2026 cs.LG

마스크 디퓨전 언어 모델의 내재적 정규화 기법 조정: k-패리티 분석을 통한 일반화 성능 향상

Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity

Citations: 4,150

h-index: 29

Citations: 60

h-index: 4

최근 마스크 디퓨전 언어 모델은 강력한 생성 패러다임으로 부상했지만, 자기 회귀 모델과 비교했을 때 일반화 성능에 대한 연구는 여전히 부족합니다. 본 연구에서는 k-패리티 문제(k개의 관련 비트의 XOR 합 계산)라는 설정에서 이러한 일반화 특성을 조사합니다. 신경망은 일반적으로 '그로킹(grokking)' 현상을 보이는데, 이는 우연 수준의 성능이 지속되는 장기간의 정체 후 갑작스러운 일반화로 나타납니다. 우리는 마스크 디퓨전(MD) 목적 함수를 이론적으로 신호 영역(특징 학습을 유도)과 노이즈 영역(내재적 정규화 역할을 수행)으로 분해합니다. k-패리티 문제에 MD 목적 함수를 사용하여 nanoGPT 모델을 학습시킨 결과, MD 목적 함수가 학습 환경을 근본적으로 변화시켜, 그로킹 현상 없이 빠르고 동시적인 일반화를 가능하게 한다는 것을 보여줍니다. 또한, 우리는 이론적 통찰력을 활용하여 MD 목적 함수의 마스크 확률 분포를 최적화합니다. 우리의 방법은 5천만 파라미터 모델의 퍼플렉시티를 크게 향상시키며, 처음부터 사전 학습하는 경우와 지도 학습 미세 조정 모두에서 우수한 결과를 달성합니다. 특히, 80억 파라미터 모델에서 각각 최대 8.8% 및 5.8%의 성능 향상을 관찰했으며, 이는 대규모 마스크 디퓨전 언어 모델 환경에서 본 프레임워크의 확장성과 효과성을 확인시켜줍니다.

Original Abstract

Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain understudied compared to their auto-regressive counterparts. In this work, we investigate these properties within the setting of the $k$-parity problem (computing the XOR sum of $k$ relevant bits), where neural networks typically exhibit grokking -- a prolonged plateau of chance-level performance followed by sudden generalization. We theoretically decompose the Masked Diffusion (MD) objective into a Signal regime which drives feature learning, and a Noise regime which serves as an implicit regularizer. By training nanoGPT using MD objective on the $k$-parity problem, we demonstrate that MD objective fundamentally alters the learning landscape, enabling rapid and simultaneous generalization without experiencing grokking. Furthermore, we leverage our theoretical insights to optimize the distribution of the mask probability in the MD objective. Our method significantly improves perplexity for 50M-parameter models and achieves superior results across both pre-training from scratch and supervised fine-tuning. Specifically, we observe performance gains peaking at $8.8\%$ and $5.8\%$, respectively, on 8B-parameter models, confirming the scalability and effectiveness of our framework in large-scale masked diffusion language model regimes.

1 Citations

0 Influential

14.5 Altmetric

73.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!