Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
46 tokens/sec
GPT-5 Medium
19 tokens/sec
GPT-5 High Premium
32 tokens/sec
GPT-4o
87 tokens/sec
DeepSeek R1 via Azure Premium
98 tokens/sec
GPT OSS 120B via Groq Premium
435 tokens/sec
Kimi K2 via Groq Premium
207 tokens/sec
2000 character limit reached

Native-resolution Diffusion Transformer (NiT)

Updated 30 June 2025
  • Native-resolution Diffusion Transformer (NiT) is a generative model that produces images at any resolution by processing variable-length visual token sequences natively.
  • It introduces innovations like dynamic tokenization, packed processing via FlashAttention-2, and axial 2D rotary positional embedding for effective spatial encoding.
  • NiT achieves state-of-the-art performance on benchmarks while enabling robust zero-shot generalization to unseen resolutions and aspect ratios.

The Native-resolution Diffusion Transformer (NiT) is a generative model architecture designed to directly synthesize images at arbitrary resolutions and aspect ratios within a single, unified framework. NiT overcomes the limitations of traditional fixed-resolution diffusion models by natively handling variable-length visual token sequences and explicitly modeling intrinsic visual distributions across a diverse set of image sizes. This flexibility enables high-fidelity image generation at both standard and previously unseen resolutions and shapes, providing state-of-the-art performance on standard benchmarks as well as robust zero-shot generalization to new formats (Wang et al., 3 Jun 2025).

1. Architectural Foundations and Innovations

NiT departs from conventional transformer-based diffusion models, such as DiT, by removing the constraint of fixed-size, square inputs. The core architectural innovations include:

  • Dynamic Tokenization and Packing: Images are first compressed via a learned autoencoder to latent space, then patchified into tokens, where each patch corresponds to one token. For an image of size Hi×WiH_i \times W_i, the number of tokens produced is proportional to the area (HiWi/p2H_i W_i / p^2 for patch size pp). This results in a variable-length sequence for each image.
  • Packed Processing via FlashAttention-2: To efficiently handle a batch containing images of arbitrary sizes, a "longest-pack-first" algorithm is used to arrange variable-length token sequences into a single packed batch, meeting a global token budget LL. NiT takes advantage of FlashAttention-2 to compute attention efficiently across this packed batch.
  • Axial 2D Rotary Positional Embedding (2D RoPE): To encode position information independent of absolute grid size, each token is assigned 2D axial position embeddings. 2D RoPE provides the model with explicit spatial coordinates, supporting relational reasoning across any resolution or aspect ratio.
  • Packed Adaptive Layer Normalization: For conditioning (e.g., timestep, class), normalization is applied per instance and broadcast to all token fragments belonging to that image within the packed batch, maintaining correct conditional signals during variable-length batch processing.

This design eliminates any resizing, cropping, or bucketing—NiT operates on and generates outputs at the image's original native resolution and shape.

2. Denoising Process at Arbitrary Resolutions

NiT implements a standard diffusion process with modifications for the native-resolution setting:

  • Forward (Noising) Process: For each image, random noise ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) is added to the clean latent tokens: xt=αtx+σtϵx_t = \alpha_t x + \sigma_t \epsilon, where αt=1t\alpha_t = 1-t, σt=t\sigma_t = t, and tt is drawn from a logit-normal schedule for flow-matching.
  • Per-Instance Application: The noise is applied consistently per image instance—each set of tokens in the packed sequence is noised according to its own sampled timestep.
  • Denoising via Packed Self-Attention: The packed token batch is processed by the NiT transformer, with attention and adaptive normalization restricted such that each image's variable-length token sequence is correctly isolated. The transformer predicts clean latent tokens, which are subsequently decoded to the original image space at the desired native resolution.

This denoising process equips the model to reconstruct images of any native format, learning scale- and shape-agnostic visual distributions.

3. Performance Benchmarks and Empirical Findings

NiT establishes state-of-the-art results on standard image generation tasks using a single model:

Model #Params #Res #Tokens (train) FID (256×256) FID (512×512) mFID
DiT-XL/2 675M 256/512 1428B 2.27 3.04 2.66
SiT-XL/2 675M 256/512 1428B 2.06 2.62 2.34
SiT-REPA 675M 256/512 525B 1.42 2.08 1.75
NiT-XL 675M Native 197B 2.03 1.45 1.74
  • Simultaneous Multi-Resolution Capability: Unlike prior approaches that require separate models per resolution, NiT achieves comparable or superior FID in both 256×256 and 512×512 settings with a single set of weights. On ImageNet-512×512, NiT-XL attains FID 1.45, outperforming larger models such as EDM2-XXL.
  • Data/Compute Efficiency: The model achieves these results with far fewer training tokens, as there is no need to repeat computation across separate resolution-specific models.

4. Zero-Shot Generalization to Unseen Resolutions and Aspect Ratios

A major contribution of NiT is its strong generalization to novel resolutions and aspect ratios not encountered during training.

  • High-Resolution Synthesis: NiT (trained only on up to 512×512) achieves FID as low as 4.52 at 1024×1024 and produces plausible images up to 2048×2048, surpassing fixed-resolution models that degrade severely outside their training domain.
  • Aspect Ratio Generalization: The model maintains low FID and high IS across diverse ratios, including 1:3, 16:9, 4:3, 3:1, etc.
  • Ablative Analysis: Removing native-resolution data from the training set (e.g., training only on square images) eliminates zero-shot generalization. The inclusion and diversity of native/aspect-ratio images are necessary and sufficient for this generalization behavior.

This indicates that the model learns intrinsically scale- and shape-invariant distributional priors over visual content.

5. Comparison to Traditional Fixed-Resolution and Bucketed Approaches

Aspect Traditional Transformer Diffusion NiT (Native-resolution)
Input handling Cropped/scaled to N×NN \times N Original resolution, any shape
Sequence handling Fixed-length, padded, or bucketed Variable-length, packed sequences
Positional encoding Learned/fixed grid Axial 2D RoPE, per-token position
Architecture per size One per resolution Single model, any resolution
Generalization Limited (artifacts out-of-domain) Robust even at unseen resolutions

Traditional models are prone to truncation artifacts, loss of semantics, or poor detail at non-square or unforeseen sizes. NiT, in contrast, synthesizes semantically faithful, visually coherent images across the full spectrum of possible resolutions without architectural changes.

6. Broader Implications and Future Applications

NiT introduces an architectural approach that mirrors the variable-length, sequence-handling flexibility of LLMs, suggesting a pathway for unifying vision and text generative modeling:

  • Universal Visual Foundation Model: NiT serves as a universal generative model capable of deployment across domains where native resolution and aspect ratio diversity are required (e.g., photography, remote sensing, medical imaging, document analysis).
  • Seamless Multimodal Fusion: By supporting arbitrary input/output lengths, NiT can be naturally integrated with LLMs for vision-language tasks and future multimodal generative agents.
  • Video and Multi-frame Generation: The packed, variable-length mechanism is amenable to extension to temporal domains, supporting native-resolution video and animation synthesis.

The architecture’s explicit handling of packed sequences, robust 2D positional embedding, and per-instance normalization underpin its generalization and efficiency, positioning it as an influential backbone for next-generation vision AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube