2605.05331v1 May 06, 2026 cs.CV

ViTok-v2: 50억 파라미터 규모의 고해상도 오토인코더 확장

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

Animesh Sinha

Meta

Citations: 911

h-index: 10

Felix Juefei-Xu

Citations: 1,200

h-index: 12

Jingren Hou

Citations: 3

h-index: 1

S. Vishwanath

Citations: 0

h-index: 0

Philippe Hansen-Estruch

Citations: 1,215

h-index: 7

Jiahui Chen

Citations: 7

h-index: 1

V. Ramanujan

Citations: 2,193

h-index: 12

Orr Zohar

Citations: 773

h-index: 8

Yan Ping

Citations: 0

h-index: 0

Markos Georgopoulos

Citations: 1,418

h-index: 15

Edgar Schoenfeld

Citations: 0

h-index: 0

Ali K. Thabet

Citations: 690

h-index: 6

비전 트랜스포머(ViT) 오토인코더는 이미지 토큰화에 있어 컨볼루션 기반 토큰화 방식보다 더 나은 재구현 성능을 제공하는 매력적인 방법으로 부상했습니다. 그러나 기존 ViT 토큰화 방식은 학습된 해상도 범위를 벗어나면 성능이 저하되며, 적대적 손실에 대한 의존성은 안정적인 확장을 어렵게 만듭니다. ViTok (Hansen-Estruch et al., 2025)은 압축 비율(r)이 재구현과 생성 간의 균형을 조절하며, 낮은 r 값은 더 나은 재구현을 제공하지만 생성은 더 어렵게 만든다는 것을 발견했습니다. 따라서 토큰화기의 재구현 성능 향상은 더 나은 성능을 제공하는 토큰화기를 개발하는 데 중요합니다. 본 논문에서는 이러한 한계점을 해결하기 위해 고해상도를 지원하는 NaFlex를 사용하여 다양한 해상도 및 종횡비에 대한 일반화 성능을 향상시키고, LPIPS 및 GAN 목표를 대체하는 새로운 DINOv3 지각 손실을 사용하여 모든 규모에서 안정적인 학습을 가능하게 하는 ViTok-v2를 소개합니다. ViTok-v2는 약 20억 개의 이미지로 학습되었으며, 현재까지 개발된 가장 큰 이미지 오토인코더로서 50억 개의 파라미터를 갖습니다. ViTok-v2는 256p에서 최고 수준의 재구현 성능을 달성하며, 512p 이상의 해상도에서 기존 방식보다 우수한 성능을 보입니다. 플로우 매칭 생성기를 사용한 공동 확장 실험을 통해 오토인코더와 생성기를 함께 확장하면 재구현과 생성 간의 균형을 더욱 발전시킬 수 있음을 보여줍니다.

Original Abstract

Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance degrades outside training resolutions, and reliance on adversarial losses prevents stable scaling. ViTok (Hansen-Estruch et al., 2025) found that the compression ratio r mediates a reconstruction-generation trade-off where lower r means better reconstructions but harder generations, so improving tokenizer reconstruction is key to more Pareto-optimal tokenizers. We introduce ViTok-v2, which addresses these limitations with native resolution support via NaFlex for generalization across resolutions and aspect ratios, and a novel DINOv3 perceptual loss that replaces both LPIPS and GAN objectives for stable training at any scale. ViTok-v2 is trained on about 2B images and scaled to 5B parameters, the largest image autoencoder to date. ViTok-v2 matches or exceeds state-of-the-art reconstruction at 256p and outperforms all baselines at 512p and above. In joint scaling experiments with flow matching generators, we show that scaling both the autoencoder and the generator advances the Pareto frontier of this trade-off.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!