2604.01545v1 Apr 02, 2026 cs.AI

RAE-AR: 표현 자동 인코더를 활용한 자기 회귀 모델 개선 연구

RAE-AR: Taming Autoregressive Models with Representation Autoencoders

Nan Duan

Citations: 24

h-index: 2

Zeyue Xue

Citations: 294

h-index: 5

Haoyang Huang

Citations: 344

h-index: 5

Hang Xu

Citations: 8

h-index: 2

Jie Huang

Citations: 24

h-index: 2

Feng Zhao

Citations: 1

h-index: 1

Hu Yu

Citations: 160

h-index: 5

생성 모델의 잠재 공간은 오랫동안 VAE 인코더에 의해 지배되어 왔습니다. 사전 훈련된 표현 인코더(예: DINO, SigLIP, MAE)에서 추출된 잠재 변수는 과거에 생성 모델에 적합하지 않은 것으로 간주되었습니다. 최근 RAE 방법은 희망을 제시하며, 표현 자동 인코더 역시 VAE 인코더와 경쟁력 있는 성능을 달성할 수 있음을 보여주었습니다. 그러나 표현 자동 인코더를 연속적인 자기 회귀(AR) 모델에 통합하는 것은 아직 널리 연구되지 않았습니다. 본 연구에서는 RAE-AR라고 명명된, 고차원 표현 자동 인코더를 자기 회귀 패러다임 내에서 사용하는 데 따르는 과제를 조사합니다. 우리는 자기 회귀 모델의 고유한 특성을 분석하고, 두 가지 주요 난관을 파악했습니다. 바로 복잡한 토큰 단위 분포 모델링과 고차원 데이터로 인해 발생하는 훈련-추론 간의 큰 격차(노출 편향)입니다. 이러한 문제점을 해결하기 위해, 모델링의 어려움을 줄이고 수렴을 개선하기 위해 분포 정규화를 통한 토큰 단순화를 도입했습니다. 또한, 훈련 과정에서 가우시안 노이즈를 주입하여 예측의 견고성을 향상시키고 노출 편향을 완화했습니다. 실험 결과, 이러한 수정 사항은 성능 격차를 크게 줄여, 표현 자동 인코더가 자기 회귀 모델에서 기존 VAE와 비교 가능한 결과를 달성할 수 있도록 합니다. 본 연구는 시각적 이해와 생성 모델 간의 더욱 통합적인 아키텍처를 위한 기반을 마련합니다.

Original Abstract

The latent space of generative modeling is long dominated by the VAE encoder. The latents from the pretrained representation encoders (e.g., DINO, SigLIP, MAE) are previously considered inappropriate for generative modeling. Recently, RAE method lights the hope and reveals that the representation autoencoder can also achieve competitive performance as the VAE encoder. However, the integration of representation autoencoder into continuous autoregressive (AR) models, remains largely unexplored. In this work, we investigate the challenges of employing high-dimensional representation autoencoders within the AR paradigm, denoted as \textit{RAE-AR}. We focus on the unique properties of AR models and identify two primary hurdles: complex token-wise distribution modeling and the high-dimensionality amplified training-inference gap (exposure bias). To address these, we introduce token simplification via distribution normalization to ease modeling difficulty and improve convergence. Furthermore, we enhance prediction robustness by incorporating Gaussian noise injection during training to mitigate exposure bias. Our empirical results demonstrate that these modifications substantially bridge the performance gap, enabling representation autoencoder to achieve results comparable to traditional VAEs on AR models. This work paves the way for a more unified architecture across visual understanding and generative modeling.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!