Scalable Image Synthesis
- Scalable image synthesis is a generative modeling approach that produces high-fidelity, consistent images across varied resolutions using techniques such as coordinate-based rendering, tokenization, and patchwise synthesis.
- The methodology employs efficient memory and compute strategies—including latent-space operations, tiled diffusion, and transformer-based attention—to maintain detail consistency and linear scaling with output size.
- Practical insights reveal challenges like artifact suppression and token bottlenecks, driving research into sparse attention and multimodal compositional frameworks for further efficiency.
Scalable image synthesis refers to generative modeling approaches capable of producing high-fidelity, consistent images across a broad range of spatial scales, resolutions, and—sometimes—aspect ratios or spatial extents, from sub-megapixel to multi-gigapixel, with computational and memory requirements that scale gracefully as output dimensions increase. Recent research addresses the need to eliminate artifacts (e.g., "texture sticking," seams, or inconsistent details), avoid fixed-size architectural bottlenecks, and provide extensible frameworks for arbitrary-resolution or open-ended multi-modal visual generation.
1. Core Architectural Paradigms for Scalability
Modern scalable image synthesis methods diverge from strictly hierarchical, convolutional generator pipelines by reformulating the generative process as either coordinate-based (implicit) computation, variable-length token modeling, or patchwise synthesis, frequently leveraging specialized representations to balance expressivity, consistency, and computational tractability.
Key frameworks include:
- Coordinate-based generators: Models such as CREPS (Nguyen et al., 2023) abandon spatial convolutions and upsampling. Every pixel is synthesized as a function of its (continuous) position and a style vector, with no dependence on the resolution-dependent architectural hierarchy. The thick bi-line decomposition
reduces per-layer memory from to and guarantees scale-equivariance.
- Variable-length tokenization and native-resolution modeling: NiT (Wang et al., 3 Jun 2025) reframes synthesis as a sequence modeling task, directly processing packed latent patches corresponding to each image’s "native" resolution and aspect ratio. Axial 2D rotary positional embeddings (RoPE) allow attention to generalize across arbitrary grids.
- Patchwise and tiled synthesis: InfinityGAN (Lin et al., 2021) and Any-Size-Diffusion (Zheng et al., 2023) generate arbitrarily-large images via seamless, coordinate-aware patch decoding and tiled diffusion, respectively, guaranteeing consistency at patch boundaries and memory cost independent of total output size.
- Hierarchical and pyramid representations: Ultra-high resolutions are achieved by decomposing the latent space into multiple spatial scales (e.g., PDM (Yang, 2024)), with diffusion or flow networks operating on each pyramid level. This design concentrates computational resources where needed and decouples global structure from local detail.
- Efficient Transformer attention: The hourglass architecture in HDiT (Crowson et al., 2024) imposes locality in attention patterns at high resolutions while retaining global coherence through early-stage global attention, reducing complexity from to in pixel count.
2. Conditioning, Consistency, and Multi-Scale Fidelity
Scalable synthesis architectures universally emphasize mechanisms for resolution- and scale-consistent detail, both within and beyond the training regime:
- Scale-consistent positional embeddings: Arbitrary-Scale Image Synthesis (Ntavelis et al., 2022) derives positional grids whose spacings are aligned across resolutions and remove zero padding, preserving translation and scale-equivariance throughout the generator.
- Continuous-scale training: Anyres-GAN (Chai et al., 2022) samples patches at variable scales and positions, explicitly conditioning the synthesis network on both coordinates and scale, training over diverse native-resolution data.
- Multi-view and 3D-aware generation: EscherNet (Kong et al., 2024) enables simultaneous multi-view synthesis by fusing reference and target tokens (augmented with specialized camera positional encodings) via self- and cross-attention transformers, scaling up to 100+ views.
- Pyramidal latent representations: PDM (Yang, 2024) uses a set of latent feature maps across decreasing resolutions, integrating coarse and fine information during the decoding phase for robust global structure with high-frequency details.
Empirical evaluations consistently demonstrate that these conditioning mechanisms yield favorable FID, SSIM, and perceptual consistency metrics on both standard and novel-scale benchmarks, often matching or approaching the strongest fixed-scale models at each size.
3. Memory, Compute, and Throughput Scaling
A central requirement for true scalability is that memory and runtime cost increase at most linearly in output pixel count and avoid quadratic blowup in self-attention or other operations.
- Latent-space modeling: Methods such as LSSGen (Tang et al., 22 Jul 2025) and STARFlow (Gu et al., 6 Jun 2025) operate progressively in the autoencoder’s compressed latent space. Latent upsampling, learned at low cost, enables efficient multi-resolution diffusion or flow processing without repeated passage through the encoder/decoder.
- Hierarchical attention and downsampling: The hourglass transformer structure in HDiT (Crowson et al., 2024) reduces self-attention scope at fine resolutions through neighborhood attention, confining operations to early bottleneck stages ( = token count).
- Tiling and memory-efficient upsampling: ASD (Zheng et al., 2023) partitions the synthesis process into a composition-adaptive stage (ARAD) and a fast, implicitly overlapping tiled diffusion stage (FSTD), avoiding seams and drastically dropping peak memory from to , where 0 are small tile dimensions.
- Packed variable-length transformers: NiT (Wang et al., 3 Jun 2025) introduces a packing strategy for variable-length latent streams within a constant total token budget, enabling high-throughput training and inference across images of disparate resolutions and aspect ratios.
- Scaling and compression in flows and autoregressive models: STARFlow (Gu et al., 6 Jun 2025) leverages TARFlow’s deep–shallow transformer design for normalizing flows, limiting both attention and flow-layer depth while retaining universality for continuous densities.
4. Comparative Performance and Empirical Benchmarks
The leading scalable synthesis models systematically report quantitative measures across multiple resolution regimes, with FID, CLIP-score, IS, SSIM/LPIPS, and user studies.
Selected results illustrating the scalability trend:
| Model (Resolution) | FID (lower is better) | Compute/Memory Notes |
|---|---|---|
| CREPS (FFHQ 512²/1024²) | 4.43 / 4.09 | 1 memory, 0.05s/step (Nguyen et al., 2023) |
| HDiT (FFHQ 1024²) | 5.23 | 2 scaling (Crowson et al., 2024) |
| PDM (CelebA-HQ 1024²/2K²) | 4.10 / 18.9 | 3-5× VRAM reduction (Yang, 2024) |
| NiT (ImageNet 512²/1536²) | 1.45 / 6.51 | Zero-shot scaling, 3 tokenization (Wang et al., 3 Jun 2025) |
| InfinityGAN (8192²) | ScaleInv FID: 121.2 | Constant memory, full parallelism (Lin et al., 2021) |
| STARFlow (ImageNet 256²/512²) | 2.40 / 3.00 | Latent, end-to-end MLE, 4 (Gu et al., 6 Jun 2025) |
| ASD (MM-CelebA-HQ up to 18432²) | FID 85.3 | Implicit overlap, 2× speedup over baseline (Zheng et al., 2023) |
| DART (DiT-16 baseline: 8.0) | 5.5 (16 steps, ImageNet) | Non-Markovian AR, KV-caching (Gu et al., 2024) |
This demonstrates that scalable designs yield not only nearly constant or sublinear memory growth but also competitive or state-of-the-art sample fidelity and alignment to text/image conditions at previously unattainable output sizes or aspect ratios.
5. Multimodal and Compositional Scalability
Next-generation scalable frameworks, motivated by cross-modal and open-set synthesis, rely on modular condition fusion, minimal parameter expansion, and plug-in compositionality:
- Composable multimodality: DiffBlender (Kim et al., 2023) separates conditions into three orthogonal channels (image-form, spatial token, non-spatial token) and fuses each with minimal UNet changes. New modalities are supported by simply introducing their encoders and updating less than 5% of the backbone.
- Retrieval-augmented semi-parametric approaches: Semi-Parametric Neural Image Synthesis (Blattmann et al., 2022) uses an external, untrained retrieval database to offload visual memorization, allowing small core networks to synthesize diverse outputs and swap domains post hoc by exchanging retrieval sets.
- Scalable view and 3D synthesis: EscherNet (Kong et al., 2024) supports simultaneous multi-view diffusion with pose-aware attention, dramatically scaling both number of views and 3D scene complexity, and opening unified pipelines for view synthesis, 3D reconstruction, and novel view extrapolation.
6. Open Challenges, Limitations, and Future Directions
While scalable image synthesis models exhibit impressive generalization and computational scaling, open questions remain:
- Consistency and artifact suppression: Achieving perfect scale-consistency and micro-structure fidelity across arbitrary or extreme dimensions remains a challenge. CREPS, for example, observes banding and floating-blob artifacts due to per-pixel independence (Nguyen et al., 2023). Explicit "scale-consistency" loss terms, robust coordinate regularization, or structural priors are areas of active research.
- Token and memory bottlenecks at extreme resolutions: Fully transformer-based approaches such as NiT reach limits due to quadratic or superlinear memory with token count. Sparse attention, hierarchical token reduction, and hybrid latents are potential solutions (Wang et al., 3 Jun 2025).
- Extending beyond 2D images: Expansion to spatio-temporal video, full 3D, multi-modal, and interactive synthesis spaces will require further breakthroughs in scalable attention, multi-branch latent representation, and dynamic condition fusion (Kong et al., 2024).
- Data diversity and training regimes: Native-resolution and mixed-resolution training protocols leverage "wild" image collections, but fully smoothing the empirical distribution, particularly at extreme aspect ratios or resolutions, is still an open challenge. Automated curriculum or token-packing schedules are under exploration (Wang et al., 3 Jun 2025).
- Modal compositionality: Integrating audio, textual, spatial, and even sensor-driven guidance in a truly open-ended compositional architecture, without retraining the generative backbone for each modality, is a long-term objective. Modular plug-and-play adapters and latent-space fusion strategies are current directions (Kim et al., 2023).
7. Summary and Outlook
Scalable image synthesis now encompasses a spectrum of generator architectures—coordinate- and token-based, patchwise and pyramidal, latent and pixel space—enabling high-fidelity, consistent image (and view, and modality) generation at arbitrary and massive scales. These models combine scale-consistent conditioning, efficient memory and compute strategies, and extensible compositionality, producing outputs that match or exceed fixed-resolution baselines while opening the field to new tasks and application domains. Key challenges at the frontier include complete artifact suppression under aggressive extrapolation, ultra-large-token scalability, generalized compositionality, and domain-agnostic adaptability. The field is rapidly evolving toward foundation models for open-ended, resolution- and modality-agnostic image synthesis.
Key references: (Nguyen et al., 2023, Crowson et al., 2024, Wang et al., 3 Jun 2025, Yang, 2024, Zheng et al., 2023, Lin et al., 2021, Chai et al., 2022, Ntavelis et al., 2022, Tang et al., 22 Jul 2025, Gu et al., 6 Jun 2025, Gu et al., 2024, Kim et al., 2023, Blattmann et al., 2022, Kong et al., 2024, Esser et al., 2020, Zhu et al., 2021, Chen et al., 2017).