BitDance: Scalable Binary AR Image Synthesis

Updated 17 February 2026

BitDance is a scalable autoregressive generative model that leverages high-entropy binary tokens to overcome representational and computational bottlenecks in image synthesis.
It employs a continuous-space binary diffusion head that replaces softmax with a U-Net style approach, enabling accurate token prediction and significant speed improvements.
Next-patch parallel decoding allows for simultaneous generation of image patches, enhancing inference efficiency and producing state-of-the-art image quality.

BitDance is a scalable autoregressive (AR) generative model for images that operates on binary visual tokens instead of traditional discrete codebook indices. Designed to address the representational and computational bottlenecks of prior AR architectures, BitDance employs a binary latent space of extremely high entropy, a continuous-space binary diffusion head for token prediction, and a novel next-patch parallel decoding method to enable efficient and expressive image synthesis across class-conditional and text-to-image regimes. These innovations lead to state-of-the-art performance in both quality and inference speed on high-resolution benchmarks, with notable parameter efficiency (Ai et al., 15 Feb 2026).

1. Binary Latent Representation

BitDance encodes images into grids of high-entropy binary tokens. For an input image $I \in \mathbb{R}^{H \times W \times 3}$ , a convolutional encoder $E_\phi$ produces latents:

$X = E_\phi(I) \in \mathbb{R}^{\frac{H}{p}\times \frac{W}{p}\times d},$

where $p$ (e.g., 16 or 32) is the spatial downsampling factor and $d=256$ designates the binary channel width. Each spatial cell thus represents a $d$ -bit vector $x \in \mathbb{R}^d$ .

A lookup-free quantization (LFQ) procedure applies a channel-wise sign operation:

$x_q = \mathrm{sign}(x) \in \{-1, 1\}^d,$

mapping values to binary codes (interpreting $-1$ as 0 and $+1$ as 1), so a single token space has $2^d = 2^{256}$ distinct possible patterns. This exponentially expands effective vocabulary size over codebook-index tokenizers. To avoid codebook collapse, BitDance maximizes code entropy across training batches with a group-wise entropy loss, promoting uniform usage of available codes and maximizing representational expressivity.

2. Binary Diffusion Head

Standard softmax classification is intractable for such a token space ( $2^{256}$ -way categorical). BitDance replaces index classification with a continuous-space diffusion process on the vertices of a $d$ -dimensional hypercube.

Forward (Noising) Process:

$x_t = t x_0 + (1-t) \epsilon, \;\; \epsilon \sim \mathcal{N}(0, I), \; t \in [0,1].$

Reverse (Denoising) Process:

A learnable velocity field $v_\theta(x_t, t, z)$ , conditioned on transformer hidden states $z$ , matches the instantaneous flow toward binary codes:

$\mathcal{L}_{\mathrm{denoise}} = \mathbb{E}_{t,x_0,\epsilon} \left\| v_\theta(x_t,t,z) - (x_0 - \epsilon) \right\|^2,$

where

$v_\theta(x_t,t,z) = \frac{f_\theta(x_t,t,z) - x_t}{1-t}$

and $f_\theta$ is a small DiT-style U-Net. Inference proceeds by discretizing $[0,1]$ into $N$ steps (usually $10$–$20$), applying Euler integration and projecting the endpoint to $\{-1,1\}^d$ .

This binary diffusion head enables tractable, accurate prediction of highly expressive binary tokens, supplanting conventional softmax heads for high-cardinality spaces.

3. Next-Patch Diffusion Decoding

To accelerate autoregressive inference, BitDance predicts entire $p \times p$ latent patches in parallel rather than individual tokens, using a raster-scan ordering and block-causal attention within the transformer. For $M$ patches $X_1,\ldots,X_M$ :

$p(X_1,\ldots,X_M) = \prod_{m=1}^M p(X_m | X_{<m})$

Each patch $X_m \in \mathbb{R}^{p^2 \times d}$ is generated jointly via the diffusion head, leveraging attention masking so tokens within a patch attend to each other but not future patches. This next-patch decoding preserves the AR factorization while allowing highly parallel inference, eliminating independence assumptions that limited prior parallel AR strategies.

The core per-patch generation loop initializes $X_{m,0} \sim \mathcal{N}(0,I)$ and, over $N$ diffusion steps, iteratively refines these latents before quantization to binary codes.

4. Architecture and Implementation

BitDance utilizes a causal transformer backbone with a variable number of layers ( $L=24,32,40$ ), hidden dimension ( $D=768, 1024, 1280$ ), and self-attention heads ( $H=D/64$ ). Block-causal masking implements raster-patch ordering.

Position encoding combines 1D rotary embeddings (RoPE) for sequence positions and 2D sinusoids for spatial coordinates. The binary diffusion head $f_\theta$ is a lightweight DiT-style U-Net (6–12 blocks), conditioned on latent representations and transformer output.

Model scale varies:

“B”: 260M parameters
“L”: 527M parameters
“H”: 1B parameters (class-conditional)
Text-to-image systems extend up to 14B parameters.

5. Experimental Results and Scaling Properties

On ImageNet $256 \times 256$ (class-conditional), BitDance-H (1B parameters, single-token AR) achieves FID = 1.24, IS ≈ 304.4, outperforming previous AR raster-scan models. BitDance-B-4× (260M parameters, patch size $2 \times 2$ ) achieves FID = 1.68 and 24.2 images/s throughput, significantly exceeding the throughput (5.17 img/s) of prior 1.4B parameter models. Next-patch diffusion yields up to $8.7\times$ speedup versus state-of-the-art parallel AR.

On 1024x1024 text-to-image generation, training involved ~450M image–text pairs using a PT/CT/SFT pipeline. On DPG-Bench, BitDance attains 88.28 overall; GenEval score is 0.86; OneIG-EN is 0.532; OneIG-ZH is 0.512. Inference latency is 12.4s on a single H100 GPU (versus 53.2s for GLM-Image and 402s for NextStep-1), for a speedup of over $30\times$ against direct AR baselines.

Ablations indicate that binary ( $2^{256}$ ) tokenization substantially improves FID (from ≈3–5 to 1.79) versus continuous VAE approaches. Diffusion heads greatly outperform bit-wise softmax replacements (FID ≈ 8.4 versus 1.79), and block-masked patch ordering yields additional quality gains (~0.2 FID). Near-optimal FID is obtained with 10–20 diffusion steps.

6. Practical Considerations

BitDance's binary tokens are stored as single bits for memory efficiency; activations in the diffusion head dominate memory use. In AR single-token mode, inference entails $N_\mathrm{steps} \times M$ diffusion updates, while next-patch mode executes parallel updates for $p^2$ tokens and can achieve ≈90 images/second at FID = 1.91 (with $p=4$ ); distilled text-to-image with $p=8$ processes 64 tokens/step.

The visual tokenizer is trained for 400K steps on DataComp-1B and domain-specific datasets. Pre-training, continuation, and supervised fine-tuning involve 256M, 99M, and 92M samples respectively, with up to 30M for distillation. Mixed-resolution training (256/512/1024 px) stabilizes convergence.

Training tricks such as group-wise LFQ, entropy loss, high EMA ($0.9999$), token dropout (0.1, for classifier-free guidance), and constant learning rate further improve robustness.

BitDance’s primary advances—the $2^{256}$ entropy-maximized binary tokenizer, continuous diffusion head for discrete binary code prediction, and next-patch parallel AR decoding—jointly deliver both high-fidelity and highly efficient image synthesis over large-scale class-conditional and multimodal text-to-image tasks (Ai et al., 15 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

BitDance: Scaling Autoregressive Generative Models with Binary Tokens (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BitDance.