2605.05206v1 May 06, 2026 cs.CV

디퓨전 트랜스포머에서 이상치 토큰 제어

Taming Outlier Tokens in Diffusion Transformers

Chen Wei

Citations: 265

h-index: 2

Xiaoyu Wu

Citations: 0

h-index: 0

Yifei Wang

Citations: 45

h-index: 3

Tsu-Jui Fu

Citations: 66

h-index: 4

Liang-Chieh Chen

Citations: 69

h-index: 3

Zhe Gan

Citations: 1

h-index: 1

본 연구는 이미지 생성에 사용되는 디퓨전 트랜스포머(DiT)에서 발생하는 이상치 토큰 문제를 다룬다. 기존 연구에서는 비전 트랜스포머(ViT)가 제한된 지역 정보를 담고 있으면서도 과도한 주의를 받는 소수의 고정규 토큰을 생성한다는 사실이 밝혀졌지만, 이러한 현상이 생성 모델에서 어떤 역할을 하는지는 충분히 연구되지 않았다. 본 연구에서는 최신 Representation Autoencoder (RAE)-DiT 파이프라인의 인코더와 디노이저 모두에서 이러한 현상이 나타난다는 것을 확인했다. 사전 학습된 ViT 인코더는 이상치 표현을 생성할 수 있으며, DiT 자체가 내부적으로 이상치 토큰을 개발할 수 있는데, 특히 중간 레이어에서 이러한 경향이 두드러진다. 또한, 단순히 고정규 토큰을 마스킹하는 것만으로는 성능 향상을 기대하기 어렵다는 것을 확인했는데, 이는 문제점이 몇 가지 극단적인 값에만 국한된 것이 아니라, 더 근본적으로는 지역 패치의 의미가 왜곡되었기 때문이다. 이러한 문제를 해결하기 위해, 본 연구에서는 Dual-Stage Registers (DSR)라는 레지스터 기반의 방법을 제안한다. 이 방법은 학습된 레지스터를 사용할 수 있는 경우 이를 활용하고, 그렇지 않은 경우에는 테스트 시간에 레지스터를 재귀적으로 적용하며, 디노이저에는 디퓨전 레지스터를 사용한다. ImageNet 데이터셋과 대규모 텍스트-이미지 생성 작업에서, 이러한 방법들은 이상치로 인한 문제를 지속적으로 감소시키고 생성 품질을 향상시키는 것으로 나타났다. 본 연구의 결과는 이상치 토큰 제어가 강력한 DiT 모델을 구축하는 데 중요한 요소임을 시사한다.

Original Abstract

We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!