2602.02493v1 Feb 02, 2026 cs.CV

PixelGen: 픽셀 확산 모델이 인지 손실을 통해 잠재 공간 확산 모델을 능가하다

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Ruihan Xu

Citations: 13

h-index: 3

Zehong Ma

Citations: 72

h-index: 6

Shiliang Zhang

Citations: 176

h-index: 7

Pixel diffusion은 VAE가 2단계 잠재 공간 확산 모델에 도입하는 문제점과 병목 현상을 피하면서, 이미지 데이터를 픽셀 공간에서 직접 엔드투엔드 방식으로 생성합니다. 그러나, 많은 인지적으로 관련 없는 신호를 포함하는 고차원 픽셀 공간을 최적화하는 것은 어렵기 때문에, 기존 픽셀 확산 모델은 잠재 공간 확산 모델보다 성능이 뒤쳐지는 경향이 있습니다. 본 논문에서는 인지적 감독을 활용한 간단한 픽셀 확산 프레임워크인 PixelGen을 제안합니다. PixelGen은 전체 이미지 공간을 모델링하는 대신, 두 가지 상호 보완적인 인지적 손실을 도입하여 확산 모델이 더욱 의미 있는 인지적 공간을 학습하도록 유도합니다. LPIPS 손실은 더 나은 로컬 패턴 학습을 돕고, DINO 기반의 인지적 손실은 전역적인 의미를 강화합니다. 인지적 감독을 통해 PixelGen은 강력한 잠재 공간 확산 모델을 능가하는 성능을 보입니다. PixelGen은 분류기-프리 가이드 없이 80개의 학습 에포크 만으로 ImageNet-256 데이터셋에서 5.11의 FID 값을 달성했으며, 대규모 텍스트-이미지 생성에서 0.79의 GenEval 점수를 기록하며 우수한 확장성을 보여줍니다. PixelGen은 VAE, 잠재 표현, 그리고 보조 단계를 전혀 필요로 하지 않으며, 더욱 간단하면서도 강력한 생성 패러다임을 제공합니다. 관련 코드는 다음 링크에서 공개적으로 이용 가능합니다: https://github.com/Zehong-Ma/PixelGen.

Original Abstract

Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.

5 Citations

0 Influential

50.139380843948 Altmetric

255.7 Score

Original PDF

205

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!