- The paper introduces a hybrid discrete-continuous tokenization method that enhances reconstruction fidelity and computational efficiency for 1024x1024 image generation.
- It leverages a scalable autoregressive transformer to predict discrete tokens combined with a lightweight residual diffusion module for fine detail recovery.
- The model achieves competitive generation quality while reducing computational cost, with up to 13.4x fewer MACs and 5.9x lower latency compared to diffusion models.
Overview of HART
HART (Hybrid Autoregressive Transformer) is an autoregressive (AR) model designed for high-resolution visual generation, specifically targeting 1024x1024 image synthesis (2410.10812). It addresses key limitations of prior AR visual generation models, namely the suboptimal image reconstruction quality associated with purely discrete tokenizers and the substantial computational demands of training transformers for high-resolution outputs. HART introduces a hybrid discrete-continuous representation mechanism coupled with a lightweight residual refinement module, achieving generation quality comparable to state-of-the-art diffusion models while demonstrating significant improvements in computational efficiency, including throughput and MACs.
Hybrid Tokenization Strategy
A core component of HART is its hybrid tokenizer, which aims to overcome the reconstruction fidelity limitations observed in discrete VQ-based tokenizers like VQGAN or those used in VAR. The HART tokenizer encodes an image x into continuous latent features z=E(x) using a CNN encoder E. These continuous latents z are then decomposed into two parts:
- Discrete Tokens (zd): These are obtained using multi-scale vector quantization, similar to the approach in VAR, capturing the global structure and semantic content of the image. The quantization process maps patches of the continuous latent features to discrete codes from a learned codebook.
- Continuous Residual Tokens (zr): These represent the difference between the original continuous latents and the de-quantized discrete tokens, i.e., zr=z−Dq(zd), where Dq is the de-quantization mapping. The residual tokens capture fine-grained details, textures, and high-frequency information lost during quantization.
The tokenizer's decoder G is trained to reconstruct the original image x^=G(zrec) from a reconstructed latent representation zrec. Crucially, the training employs an alternating strategy. In each training step, with 50% probability, the decoder reconstructs from the discrete tokens only (zrec=Dq(zd)), and with 50% probability, it reconstructs from the full hybrid representation (zrec=Dq(zd)+zr). This forces the decoder to be capable of reconstructing from both representations and encourages the discrete and continuous components to remain closely aligned. This hybrid approach demonstrates a substantial improvement in reconstruction quality, achieving a reconstruction FID (rFID) of 0.30 on MJHQ-30K 1024px, compared to 2.11 for a discrete-only VAR baseline and approaching the 0.27 rFID of the continuous VAE used in SDXL.
Model Architecture and Training
The generative process in HART involves two main modeling components:
- Scalable-Resolution Autoregressive Transformer: This transformer models the sequence of discrete tokens zd autoregressively, conditioned on input text prompts. It builds upon the VAR architecture, predicting discrete tokens in a specific order. To handle varying image resolutions efficiently, it incorporates relative position embeddings (sinusoidal step embeddings and 1D/2D rotary embeddings), allowing the model to be pre-trained at a lower resolution (e.g., 512x512) and subsequently fine-tuned for higher resolutions (e.g., 1024x1024) without requiring training from scratch. This significantly reduces the training cost for high-resolution generation.
- Residual Diffusion Module: After the AR transformer generates the discrete tokens zd, a lightweight diffusion module predicts the continuous residual tokens zr. This module consists primarily of MLP blocks and has only 37M parameters. It is conditioned on the final hidden states of the AR transformer and the predicted discrete tokens zd. Because it only models the residual component (which has lower variance and complexity compared to the full continuous latent space), it requires very few diffusion sampling steps (e.g., 8 steps) for effective prediction, making it computationally inexpensive during inference.
The final continuous latent representation passed to the decoder G for image synthesis is the sum of the de-quantized discrete tokens predicted by the AR transformer and the residual tokens predicted by the diffusion module: zgen=Dq(zd,AR)+zr,diff.
Training involves first training the hybrid tokenizer, followed by training the AR transformer to predict the discrete tokens zd, and finally training the residual diffusion module to predict zr conditioned on the outputs of the AR transformer.
HART demonstrates strong performance across various benchmarks, challenging leading diffusion models in generation quality while offering substantial efficiency gains.
- Generation Quality (1024x1024):
- On MJHQ-30K, HART achieves an FID of 5.38 and a CLIP score of 29.09. This FID represents a 31% improvement over the discrete-only VAR baseline (FID 7.85) and is competitive with diffusion models like PixArt-Σ (FID 6.34, CLIP 27.62), Playground v2.5 (FID 6.84, CLIP 28.58), and SDXL (FID 8.76, CLIP 28.60).
- On GenEval and DPG-Bench, HART (732M parameters) achieves scores (0.56 / 80.89) comparable to diffusion models with under 2B parameters.
- Class-Conditioned Generation (ImageNet 256x256):
- HART (2.0B parameters) achieves FID 1.77 and IS 330.3, outperforming MAR-L (FID 1.78, IS 296.0).
- Efficiency (1024x1024 Generation on A100 GPU):
- MACs: HART requires 12.5 T MACs, which is 6.9x to 13.4x fewer than models like SDXL (86.2 T), PixArt-Σ (168 T+), and SD3-medium (239 T).
- Latency: HART achieves an inference latency of 0.75 seconds, which is 3.1x to 5.9x faster than models like SDXL (2.3s), PixArt-Σ (4.4s), and SD3-medium (4.4s).
- Throughput (Batch=8): HART achieves 2.23 images/sec, which is 4.5x to 7.7x higher than diffusion models (0.29-0.49 images/sec).
These efficiency gains stem from the AR formulation (amenable to parallel decoding techniques and KV caching) and the use of a very small residual diffusion module requiring few steps, contrasting sharply with the iterative denoising process over many steps in standard diffusion models.
Implementation Considerations
The practical implementation of HART benefits from several architectural choices and optimizations:
- The hybrid tokenizer effectively balances reconstruction quality and suitability for AR modeling.
- The use of relative position embeddings in the AR transformer enables efficient scaling to high resolutions via fine-tuning, avoiding prohibitive full-resolution training costs.
- The residual diffusion module's small size (37M parameters) and minimal step requirement (8 steps) contribute significantly to low inference latency and MACs.
- Inference leverages standard optimizations like KV caching for the AR transformer and potentially fused GPU kernels, further boosting speed.
- The codebase is open-sourced (https://github.com/mit-han-lab/hart), facilitating reproduction and adoption.
Potential limitations might involve the inherent sequential nature of AR generation, although techniques like parallel sampling used in VAR can mitigate this to some extent. The complexity lies in managing the three-stage training process (tokenizer, AR transformer, residual diffusion).
In conclusion, HART presents a viable and highly efficient alternative to diffusion models for high-resolution image generation. By introducing a hybrid discrete-continuous tokenization scheme and combining a scalable AR transformer with a lightweight residual diffusion module, it achieves competitive visual quality with significantly reduced computational requirements, marking a notable advancement in efficient generative modeling.