2602.14041v1 Feb 15, 2026 cs.CV

BitDance: 이진 토큰을 사용한 확장 가능한 자기 회귀 생성 모델

BitDance: Scaling Autoregressive Generative Models with Binary Tokens

Shaobin Zhuang

Citations: 468

h-index: 9

Yuang Ai

Citations: 416

h-index: 10

Jiaming Han

Citations: 194

h-index: 7

Weijia Mao

Citations: 4

h-index: 1

Zhenheng Yang

Citations: 49

h-index: 3

Huaibo Huang

Citations: 22

h-index: 2

Xiangyu Yue

Citations: 650

h-index: 9

Hao Chen

Citations: 8

h-index: 2

Xuefeng Hu

Citations: 49

h-index: 3

Ziyan Yang

Citations: 559

h-index: 12

본 논문에서는 이진 시각 토큰을 예측하는 확장 가능한 자기 회귀(AR) 이미지 생성 모델인 BitDance를 제시합니다. BitDance는 높은 엔트로피를 가진 이진 잠재 변수를 사용하여 각 토큰이 최대 $2^{256}$개의 상태를 나타낼 수 있도록 하여, 압축적이면서도 매우 표현력이 뛰어난 이산 표현을 제공합니다. 이러한 거대한 토큰 공간에서 샘플링하는 것은 일반적인 분류 방법으로는 어렵습니다. 이를 해결하기 위해 BitDance는 이진 디퓨전 헤드를 사용합니다. BitDance는 소프트맥스를 사용하여 인덱스를 예측하는 대신, 연속 공간의 디퓨전을 활용하여 이진 토큰을 생성합니다. 또한, 우리는 새로운 디코딩 방법인 넥스트-패치 디퓨전을 제안합니다. 넥스트-패치 디퓨전은 여러 토큰을 병렬로 높은 정확도로 예측하여 추론 속도를 크게 향상시킵니다. ImageNet 256x256 데이터셋에서 BitDance는 1.24의 FID 값을 달성하여, 기존의 AR 모델 중 가장 우수한 성능을 보입니다. 넥스트-패치 디퓨전을 사용하면 BitDance는 14억 개의 파라미터를 사용하는 최첨단 병렬 AR 모델보다 더 적은 파라미터(2억 6천만 개, 5.4배 적음)로 8.7배 더 빠른 속도를 달성합니다. 텍스트-이미지 생성의 경우, BitDance는 대규모 멀티모달 토큰으로 학습하여 고해상도의 사실적인 이미지를 효율적으로 생성하며, 뛰어난 성능과 확장성을 보여줍니다. 1024x1024 이미지 생성 시, BitDance는 기존의 AR 모델보다 30배 이상의 속도 향상을 달성합니다. 본 논문에서는 추가적인 AR 기반 모델 연구를 촉진하기 위해 코드와 모델을 공개합니다. 코드 및 모델은 다음 주소에서 확인할 수 있습니다: https://github.com/shallowdream204/BitDance.

Original Abstract

We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to $2^{256}$ states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.

4 Citations

0 Influential

55.674470978098 Altmetric

282.4 Score

Original PDF

377

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!