Papers
Topics
Authors
Recent
2000 character limit reached

Scalable GANs with Transformers (2509.24935v1)

Published 29 Sep 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.96) on ImageNet-256 in just 40 epochs, 6x fewer epochs than strong baselines.

Summary

  • The paper proposes a novel transformer-based GAN (GAT) that leverages latent-space training to address early-layer underutilization and optimization instability.
  • It implements Multi-level Noise-perturbed image Guidance (MNG) for coarse-to-fine synthesis and adaptive learning rate scaling to ensure stable training across model sizes.
  • Experiments achieve a state-of-the-art FID of 2.96 on ImageNet-256 within 40 epochs, demonstrating the framework's efficiency and scalability.

Scalable GANs with Transformers: A Technical Analysis

Introduction

The paper "Scalable GANs with Transformers" (2509.24935) presents a systematic paper of Generative Adversarial Networks (GANs) scalability by leveraging transformer architectures and latent-space training. The authors introduce Generative Adversarial Transformers (GAT), a framework that combines pure transformer-based generators and discriminators operating in the latent space of a Variational Autoencoder (VAE). The work addresses two critical challenges in scaling GANs: underutilization of early generator layers and instability in optimization as model capacity increases. The proposed solutions—Multi-level Noise-perturbed image Guidance (MNG) and width-aware adaptive learning rate scaling—enable reliable training of GANs from small to extra-large scales, achieving state-of-the-art single-step generation performance on ImageNet-256. Figure 1

Figure 1

Figure 1

Figure 1: Curated examples of GAT-XL/2 on ImageNet-256, demonstrating strong generation capability (FID 2.96) within 40 epochs and effective latent interpolation.

Architectural Design and Training Paradigm

Transformer-based GANs in Latent Space

GAT employs Vision Transformer (ViT) backbones for both generator and discriminator, with architectural modifications to facilitate adversarial training. The generator omits the patchify layer, instead using an unpatchify (linear decoder) to synthesize images from latent representations. Conditioning is achieved via a mapping network that produces style vectors from latent codes and class labels, modulating features through adaptive normalization and Layerscale. The discriminator similarly uses a ViT backbone with Layerscale and a dedicated [cls][\text{cls}] token for real/fake classification.

Training is performed in the latent space of a pre-trained VAE (SD-VAE), reducing computational cost and enabling efficient scaling. All models are trained at a spatial resolution of 32×3232 \times 32 in the latent space, corresponding to 256×256256 \times 256 in pixel space.

Multi-level Noise-perturbed image Guidance (MNG)

A key innovation is the MNG strategy, which activates early generator layers by providing intermediate supervision. The generator is divided into KK stages, each producing auxiliary outputs. These outputs are perturbed with Gaussian noise of decreasing strength along the depth, forming a coarse-to-fine synthesis trajectory. The discriminator receives all perturbed outputs, guiding each generator stage to learn the appropriate level of structure. This mechanism ensures uniform layer utilization and mitigates the tendency of transformers to concentrate synthesis in late layers. Figure 2

Figure 2

Figure 2: Visualization of intermediate generator features and their effects; MNG activates early layers, yielding a coarse-to-fine synthesis process and improved perceptual contribution.

Adaptive Learning Rate Scaling

The authors identify that scaling model width and depth in GANs leads to instability if hyperparameters are not adjusted. Unlike diffusion models, GANs require learning rate adaptation proportional to channel dimension to maintain consistent update magnitudes. The proposed rule sets the learning rate ηadapt=ηbaseCbaseCmodel\eta_{\text{adapt}} = \eta_{\text{base}} \cdot \frac{C_{\text{base}}}{C_{\text{model}}}, where CC is the channel size. This principled adjustment stabilizes training across scales without manual tuning. Figure 3

Figure 3

Figure 3

Figure 3: Ablation paper showing that adaptive learning rate scaling ensures stable convergence across model sizes, while improper scaling leads to divergence or slow learning.

Empirical Results and Scalability Analysis

State-of-the-Art Performance

GAT-XL/2 achieves an FID of 2.96 on ImageNet-256 in only 40 epochs, outperforming strong baselines such as GigaGAN and MeanFlow, which require significantly more training epochs. The model maintains GAN advantages, including single-step inference and semantic latent space manipulation.

Scaling Behavior

Experiments across four model capacities (S, B, L, XL) demonstrate monotonic improvement in FID as model size increases. The scaling trend persists throughout training, not just at convergence, indicating robust scalability. Patch size experiments confirm that GAT is resilient to tokenization granularity, and GFLOPs analysis reveals a strong negative correlation (0.95-0.95) between compute and FID, substantiating the claim that higher compute yields better generative performance. Figure 4

Figure 4

Figure 4

Figure 4: Scalability of GAT; larger models and higher computational power systematically improve FID, confirming effective utilization of transformer scalability.

Ablation and Further Analysis

Ablation studies validate the necessity of MNG, adaptive learning rate, and VFM alignment objectives. MNG consistently enhances performance and distributes generative responsibility across layers. The REPA objective, aligning discriminator features with Vision Foundation Models (VFMs), further improves synthesis quality, indicating that representation learning advances from diffusion models transfer effectively to GANs.

Decoupled scaling of generator and discriminator shows that increasing discriminator capacity yields greater improvements in FID than scaling the generator, highlighting the centrality of discriminator representation learning in adversarial frameworks. Figure 5

Figure 5

Figure 5: Decoupled scaling analysis; scaling the discriminator is more effective than scaling the generator, and feature alignment with VFMs is higher for fake data.

Implementation Details

The models are implemented using PyTorch with bfloat16 precision, gradient checkpointing, and Scaled Dot-Product Attention. The generator uses a latent code of dimension 64, with style modulation and Layerscale initialized to small values for stability. Four intermediate outputs are synthesized for MNG, spaced uniformly across transformer blocks. The discriminator employs a 3-layer MLP for VFM alignment and uses DINOv2-B as the reference model. Differentiable augmentation is applied in conjunction with noise perturbations.

For GAT-XL/2, training on ImageNet-256 for 40 epochs requires approximately 12 days on 8×\timesNVIDIA RTX A6000 GPUs. Hyperparameters are shared across model sizes except for the learning rate, which is adapted as described.

Theoretical and Practical Implications

The work demonstrates that GANs, when paired with transformer architectures and latent-space training, can be scaled reliably and efficiently, matching or surpassing the performance of diffusion and autoregressive models in single-step generation tasks. The findings challenge the prevailing notion that GANs are inherently less scalable than other generative paradigms. The proposed MNG and adaptive learning rate strategies are lightweight, easy to implement, and generalizable to other adversarial frameworks.

The strong empirical results suggest that further scaling and longer training could yield additional gains. The centrality of discriminator representation learning, as evidenced by decoupled scaling and VFM alignment, points to promising directions for future research, including more sophisticated alignment objectives and integration with foundation models.

Conclusion

"Scalable GANs with Transformers" provides a comprehensive framework for scaling GANs using transformer architectures and latent-space training. By addressing early-layer underuse and scale-coupled instability, the GAT framework achieves state-of-the-art single-step generation on ImageNet-256 with high data efficiency. The work establishes GANs as a competitive and scalable alternative to diffusion and autoregressive models, with practical implications for efficient high-fidelity image synthesis and theoretical insights into adversarial learning dynamics.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper is about making a type of AI image generator, called a GAN (Generative Adversarial Network), work well when it gets bigger. The authors build a new GAN that uses transformers (the same kind of model behind many modern AI systems) and trains in a compact space called a “latent space” from a VAE (Variational Autoencoder). Their goal is to keep GANs fast and high-quality as they scale up, and to show clear rules for making bigger models train stably.

What questions does the paper ask?

The paper focuses on two main questions:

  • How can we design a GAN that gets better as we make it larger, without falling apart during training?
  • What simple tricks help a big GAN use all its layers effectively and keep training stable?

How does their method work?

Think of a GAN as an artist and a critic:

  • The generator (artist) tries to create realistic images.
  • The discriminator (critic) checks if images look real or fake.

Here’s what the authors changed and added:

  • Training in VAE latent space: Instead of drawing full-size pictures pixel by pixel, the generator works in a compact “code-like” image space made by a VAE. Imagine shrinking a high-res photo into a small, smart thumbnail that still captures important details. This makes training much faster while keeping image quality.
  • Pure transformers for both generator and discriminator: Transformers are models that “pay attention” to important parts of data. They scale well with more layers and wider networks, so the team uses transformer blocks in both the artist and the critic.
  • Fixing “sleepy” early layers with multi-level noise guidance (MNG): When GANs get larger, early layers of the generator can become lazy and not contribute much. To fix this, the generator outputs several intermediate images along the way. The discriminator sees these images after adding different amounts of noise:
    • Early outputs are matched to more heavily noised targets (learn big shapes).
    • Later outputs are matched to lightly noised or clean targets (add fine details).
    • This encourages a “coarse-to-fine” process, so every layer has a job.
  • Keeping training stable with a width-aware learning rate: Bigger models can change too quickly with the same learning rate, causing instability. The authors use a simple rule: as the model gets wider (more channels), reduce the learning rate roughly in proportion. It’s like driving a faster car—you turn the steering wheel more gently to stay in control.
  • Better discriminator features using a VFM (Vision Foundation Model): They nudge the discriminator’s internal features to look similar to features from a strong, pre-trained vision model (like DINOv2). This helps the critic learn more meaningful visual understanding, which then guides the generator better.

What did they find?

The authors tested their method on ImageNet at 256×256 resolution (a large dataset of labeled images) with class-conditional generation (the model is told which category to draw, like “tiger” or “guitar”). They used a score called FID (Frechet Inception Distance) where lower is better.

Key results:

  • Their largest model, GAT-XL/2, reaches a state-of-the-art FID of about 2.96 for single-step image generation, and does it in only 40 epochs (around 6 times fewer training rounds than strong baselines). “Single-step” means it makes an image in one pass, much faster than diffusion models that need many steps.
  • Scaling works: Larger models consistently get better scores. More compute (GFLOPs) strongly correlates with better FID.
  • The multi-level noise guidance (MNG) makes early layers actively contribute and improves image quality throughout training.
  • The adaptive learning rate rule prevents training from diverging and makes scaling simple—no heavy re-tuning for each size.
  • Aligning the discriminator’s features with a strong vision model further boosts performance.

Why is this important?

  • Speed: GANs can generate images in one step, which is much faster than methods that need hundreds of steps. This makes them great for real-time or large-scale use.
  • Quality and scalability: The paper shows that GANs can match or beat other models in quality when designed and trained carefully, and they can improve as they get bigger.
  • Simple, practical rules: The two main fixes—multi-level noise supervision and width-aware learning rate—are easy to implement and make training more reliable.

What does this mean for the future?

  • Better, faster image generators: This work suggests we can build big, fast, high-quality GANs using transformers and latent-space training, making them useful for apps like instant image creation, video, or 3D.
  • Clear scaling recipes: The simple training rules help teams avoid trial-and-error when making larger GANs.
  • Stronger critics matter: Improving the discriminator’s understanding of images (using foundation models) can significantly improve the generator. Future work may focus even more on smarter discriminators.

Overall, the paper shows a practical path to scaling GANs: use transformer backbones, train in VAE latent space for efficiency, guide early layers with multi-level noise, and adjust learning rate by width. With these tools, they achieve top-tier results quickly and reliably.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues and uncertainties that could guide future research directions:

  • Generalization beyond ImageNet-256: No evaluation on other datasets (e.g., FFHQ, LSUN, COCO, LAION subsets), domains (medical, satellite), or long-tail/imbalanced settings; robustness to distribution shift remains unknown.
  • Resolution scalability: Only 256×256 is reported; feasibility and performance at higher resolutions (512, 1024, 2048) are untested, especially given quadratic token growth in transformers.
  • Conditioning scope: Results are class-conditional only; applicability to unconditional and text-conditional generation (e.g., CLIP/LLM conditioning) is unexplored.
  • Tokenizer dependence: The approach relies on SD-VAE with 8× downsampling and fixed 32×32 latents; the impact of the VAE bottleneck (reconstruction artifacts, semantic loss) on adversarial learning and ultimate image quality is not analyzed.
  • Alternative latent/tokenizers: No comparison to alternative latent spaces (e.g., improved VAEs, VQ-VAEs, MAE-like continuous latents) or to jointly fine-tuning the tokenizer with G/D.
  • End-to-end training: The effects of jointly updating the VAE (encoder/decoder) with GAN training are not studied (stability, quality, compute trade-offs).
  • MNG design space: Multi-level Gaussian noise is the only corruption tested; other perturbations (blur, masking, downsample, JPEG, stochastic augmentations, diffusion-style noise schedules) and hybrid schemes are not evaluated.
  • Number and placement of intermediate outputs: Choice of K=4 and uniform stage spacing lacks sensitivity analysis; optimal K, stage placement, and adaptive schedules are unknown.
  • Discriminator interface for multi-level inputs: Architectural alternatives for ingesting and fusing multiple noised outputs (token concatenation vs. cross-attention vs. shared towers) are not compared.
  • Memory/compute overhead of MNG: Wall-clock training cost, memory footprint, and throughput impact of multiple intermediate outputs and noised passes are not quantified.
  • Stability and theory of LR scaling: The proposed width-aware learning-rate rule does not address depth, batch size, optimizer betas, EMA decay, weight decay, or gradient clipping; theoretical justification and cross-setting robustness are limited.
  • Generality of LR rule: Applicability to non-transformer GANs, different normalizations (LayerNorm vs RMSNorm), activations, or architectures with non-constant channel width is untested.
  • Discriminator–generator capacity ratio: While scaling D helps more than G, the optimal capacity allocation (G:D parameters/compute) and dynamic scheduling across training are not established.
  • Training dynamics: Detailed analyses of adversarial game stability (e.g., oscillations, mode collapse, gradient norms, spectral properties) and how MNG/LR scaling modulate them are absent.
  • Regularization interactions: Interplay of approximated gradient penalty, DiffAug, Layerscale, and MNG is not disentangled; ablations on each component’s necessity at scale are limited.
  • Metric coverage: Only FID (5K/50K) is reported; no precision/recall, density/coverage, IS, CLIP score, human preference, or diversity metrics; mode coverage and trade-offs with guidance are unknown.
  • Memorization and privacy: No nearest-neighbor, train–test overlap, or membership inference analyses; risk of memorization at XL scale is unassessed.
  • Seed variance and reliability: Training variance across random seeds and run-to-run stability are not reported; reproducibility under minimal hyperparameter tuning is unclear.
  • Compute–performance trade-offs: GFLOPs vs FID is shown for generator forward passes, but not end-to-end wall-clock, energy, memory, or throughput; fairness vs baselines in compute budgets is not established.
  • Patch size scaling: Only p=2 and p=4 tested on smaller models; impact of tokenization granularity at larger scales (L/XL) and on visual fidelity vs compute is not characterized.
  • Latent dimension: The latent size is fixed (d_z=64); effects of latent dimensionality on diversity, controllability, and stability are not explored.
  • Latent semantics: Claims of latent interpolation are qualitative; no quantitative disentanglement or semantic direction analysis is provided.
  • Inference-time guidance: Latent-space guidance is briefly used; its effects on diversity, class consistency, and failure modes (e.g., overfitting to class means) are not systematically studied.
  • Bias and feature leakage: REPA aligns D to DINOv2 (ImageNet-trained); potential leakage, domain bias, and fairness impacts on evaluation (FID) are not analyzed.
  • Domain transfer of REPA: Efficacy of VFM alignment on non-ImageNet domains, high resolutions, or with different VFMs (e.g., CLIP variants) is untested; when does REPA hurt/help?
  • MNG vs MSG and coarse-to-fine alternatives: Comparative studies with hierarchical resolution generators, explicit pyramids, and self-conditioning approaches are limited.
  • Early-layer underuse: Root causes beyond lack of supervision (e.g., attention patterns, token mixing, initialization) are not deeply probed; architectural remedies other than MNG are not evaluated.
  • Layerscale and RMSNorm choices: Criticality of Layerscale initialization, RMSNorm vs LayerNorm, qk-normalization, RoPE, and SwiGLU is not ablated for stability and scale-up.
  • Data efficiency claims: While fewer epochs are used, compute-normalized comparisons (FLOPs×epochs) against baselines are missing; benefits under limited data regimes are unknown.
  • OOD robustness: Behavior under corruptions (ImageNet-C/P), stylization, adversarial perturbations, or domain shifts has not been evaluated.
  • Safety and content control: The framework’s controllability, safety filtering, and bias mitigation (especially for scaling to web-scale data) are unaddressed.
  • Failure case taxonomy: No qualitative/quantitative analysis of common failure modes (e.g., background artifacts, class mixing, texture bias) to guide targeted improvements.
  • Open-sourcing and reproducibility: Public availability of code, pretrained models, and training logs for verification and extension is not specified.
  • Multi-modal extensions: Integration with text encoders and LLMs, and compatibility with instruction following or compositional prompts are unexplored.
  • Theoretical perspective on MNG: Formal connections to diffusion’s noise-conditioning and multi-scale learning, or guarantees about layer utilization and gradient signal routing, are not provided.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that can leverage the paper’s findings (GAT: transformer-only GANs trained in VAE latent space with multi-level noise guidance and width-aware learning-rate scaling).

  • Low-latency image generation for creative production
    • Sectors: media/entertainment, advertising, e-commerce, design
    • Tools/workflows: single-step, class-conditional image generators for rapid A/B testing of product shots, ad creatives, thumbnails; plug-ins for Adobe/Blender/Figma to generate and iterate images in real time
    • Assumptions/dependencies: access to a pretrained SD-VAE tokenizer; domain-appropriate fine-tuning data (labels for class-conditional); content safety filters and licensing for training data; GPU for best quality/latency; quality currently validated at 256×256
  • Interactive latent-space editing and interpolation tools
    • Sectors: creative software, gaming, AR/VR prototyping
    • Tools/workflows: UI sliders for semantic style/content manipulation (interpolation between latent codes), batch style transfer for design exploration
    • Assumptions/dependencies: reliance on GAN’s semantic latent space; requires UX integration and light fine-tuning for domain semantics
  • Cost- and energy-efficient image synthesis in production systems
    • Sectors: cloud platforms, MLOps, energy-conscious AI deployments
    • Tools/workflows: replacing multi-step diffusion components with one-step GAT modules in services where speed/throughput dominates; autoscaling microservices with per-model LR scaling and small-batch training
    • Assumptions/dependencies: integration with existing inference serving; monitoring for GAN failure modes; model selection (S/B/L/XL) matched to latency and quality requirements
  • Rapid synthetic data augmentation for vision classifiers
    • Sectors: retail (product recognition), manufacturing (defect detection), agriculture, remote sensing
    • Tools/workflows: class-conditional augmentation pipelines to balance long-tail classes or stress-test models; mix real and GAT-generated samples to improve robustness
    • Assumptions/dependencies: careful validation against domain shift; bias auditing; label availability for class-conditional training; SD-VAE suitability for domain
  • Single-GPU/edge-friendly image generation with small GAT variants
    • Sectors: mobile/edge devices, on-prem solutions with constrained compute
    • Tools/workflows: deploy GAT-S/B with quantization/graph optimizations for kiosks, retail endpoints, or design workstations; on-device prototyping apps
    • Assumptions/dependencies: model compression/quantization; optimized attention kernels; power/thermal limits; resolution constraints
  • Stable scaling recipes for existing GAN-based tasks
    • Sectors: super-resolution, image-to-image translation, face synthesis
    • Tools/workflows: adopt width-aware learning-rate rule to stabilize large transformer-based GANs; add multi-level noise-perturbed guidance to reactivate early layers
    • Assumptions/dependencies: adaptation of MNG to task-specific architectures; re-tuning noise schedules; compatibility with task losses
  • Discriminator-as-encoder for downstream perception
    • Sectors: visual search, QA of generated content, anomaly detection
    • Tools/workflows: reuse REPA-aligned discriminator features (aligned with DINOv2) as embeddings for similarity search or quality scoring
    • Assumptions/dependencies: training D with REPA; licensing/compliance for VFM (e.g., DINOv2); calibration for target tasks
  • Academic baseline for scaling studies in adversarial learning
    • Sectors: academia, research labs
    • Tools/workflows: reproducible S→XL scaling curves, GFLOPs–FID analyses, ablations (MNG, LR scaling, VFM alignment) as teachable benchmarks
    • Assumptions/dependencies: availability of code and SD-VAE; standardized evaluation (FID-5k/50k); compute for XL models (or smaller scales for coursework)
  • Procurement and sustainability guidance for ML ops
    • Sectors: policy/enterprise governance, sustainability teams
    • Tools/workflows: prefer single-step generators to reduce training epochs and inference NFEs; estimate carbon savings when swapping multi-step diffusion with GAT
    • Assumptions/dependencies: life-cycle analysis data; organizational readiness to measure/track compute and emissions

Long-Term Applications

These use cases are promising but require further research, scaling, adaptation to new modalities, higher resolutions, or stronger controls.

  • High-resolution and text-to-image extensions
    • Sectors: design, advertising, synthetic media platforms
    • Tools/workflows: extend class-conditional GAT to robust text conditioning (e.g., via T5/CLIP encoders) and to 512–4k resolutions; integrate guidance and compositional controls
    • Assumptions/dependencies: conditioning architecture changes; more data and compute; improved tokenizers for high-res fidelity; evaluation beyond FID (human/rating studies)
  • Real-time content generation on consumer devices
    • Sectors: mobile apps, AR filters, creator tools
    • Tools/workflows: compressed/quantized GAT variants running on NPUs/GPUs in phones or glasses; interactive photo creation/editing without cloud
    • Assumptions/dependencies: advanced model compression, attention acceleration, memory-aware architectures; power constraints; on-device safety filters
  • Video and multi-frame generation with temporal consistency
    • Sectors: advertising, pre-viz, gaming, telepresence
    • Tools/workflows: adapt MNG’s coarse-to-fine supervision across time to enforce temporal coherence; one-step or few-step video GANs in latent space
    • Assumptions/dependencies: temporal tokenization, recurrent/attention designs, temporal discriminators; large-scale video data; new stability recipes
  • 3D/NeRF and multi-view-consistent asset generation
    • Sectors: VFX, gaming, CAD/PLM, e-commerce 3D
    • Tools/workflows: GAT as a backbone for generating multi-view-consistent images or radiance fields; latent-space controls for shape/style interpolation
    • Assumptions/dependencies: 3D-consistency losses, multi-view data, geometry-aware discriminators, integration with 3D pipelines (NeRF/mesh)
  • Domain-specific synthetic data factories for regulated sectors
    • Sectors: healthcare (medical imaging), finance (document synthesis), autonomous driving (corner cases)
    • Tools/workflows: GAT-based synthetic datasets with rigorous bias/fairness controls, privacy safeguards, and uncertainty modeling; fast class-balanced scenario generation
    • Assumptions/dependencies: regulatory approval, clinical/industry validation, domain-specific tokenizers, robust OOD detection and audit trails
  • General-purpose visual encoders from adversarial training
    • Sectors: foundation model pretraining, search/retrieval
    • Tools/workflows: train discriminators with REPA or similar alignment at scale to produce strong, generalizable vision features; pretrain–finetune pipelines
    • Assumptions/dependencies: large curated datasets; careful alignment objectives; evidence of transfer superiority vs. existing VFMs
  • Synthetic content governance and provenance ecosystems
    • Sectors: policy, platforms, cybersecurity
    • Tools/workflows: watermarking and detection adapted to single-step GANs; provenance metadata pipelines; platform policies tuned to scalable, efficient generators
    • Assumptions/dependencies: robust watermarking for GAN outputs; standards adoption (e.g., C2PA); cooperative platform enforcement and legal frameworks
  • Energy- and cost-aware generative AI infrastructure
    • Sectors: cloud/edge providers, sustainability initiatives
    • Tools/workflows: scheduling and autoscaling tuned to one-step models; SLA tiers based on NFE and compute footprints; carbon-aware routing
    • Assumptions/dependencies: reliable telemetry for model energy; pricing models that incentivize low-NFE generation; shared benchmarks
  • Programmatic controllability for enterprise workflows
    • Sectors: marketing ops, e-commerce PIM, DAM systems
    • Tools/workflows: APIs exposing class-conditional or latent controls (brand palettes, layouts) to auto-generate on-brand variants; approval loops with human-in-the-loop
    • Assumptions/dependencies: robust conditioning interfaces, brand-safe datasets, governance for IP and content integrity
  • Cross-modal extensions (image–text–layout; image–audio)
    • Sectors: publishing, education, accessibility
    • Tools/workflows: GAT variants conditioned on layouts, captions, or audio cues for multimodal content assembly
    • Assumptions/dependencies: cross-modal tokenizers, multimodal discriminators, datasets with aligned annotations

Notes on feasibility and dependencies across applications

  • Core technical dependencies: SD-VAE (or equivalent) quality and license; access to labeled data for class-conditional training; availability of VFMs for REPA (e.g., DINOv2) with compatible licenses.
  • Compute: while one-step sampling and reduced epochs cut costs, XL-scale training still requires multi-GPU clusters; deployment on edge requires compression and kernel optimization.
  • Generalization: reported state-of-the-art is on ImageNet-256; transferring to higher resolutions, domains, or modalities requires further validation.
  • Safety/ethics: content filters, watermarking, and dataset governance are essential for responsible deployment, particularly in consumer and policy-sensitive contexts.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • AdaLN-zero: A transformer normalization variant used in diffusion models that initializes adaptive layer normalization parameters to zero to stabilize training. "DiT-XL/2 & 675M & 250×\times2 & 2.27 ~~ SiT-XL/2 & 675M & 250×\times2 & 2.06 ~~ SiT-XL/2+REPA & 675M & 250×\times2 & 1.42" (mentions "AdaLN-zero layer" in the additional related works section)
  • AdamW: An optimizer that combines Adam with decoupled weight decay for better generalization. "The optimizer is AdamW with (β1,β2)=(0.0,0.99)(\beta_1,\beta_2)=(0.0,\,0.99) (following common GAN practice such as StyleGAN)."
  • Adaptive learning rate: Adjusting the learning rate based on model scale to keep update magnitudes stable. "we use identical hyperparameters for every scale of models except the learning rate, which we adaptively modify as elaborated in Sec.~\ref{sec 2.4: adaptive lr} ."
  • Autoregressive models: Generative models that produce outputs sequentially, often token by token or patch by patch. "\multicolumn{4}{l}{autoregressive/masking}"
  • bfloat16 precision: A 16-bit floating-point format that balances range and precision for efficient training. "Also, we use a batch size of $512$, bfloat16 precision, gradient checkpointing, and PyTorch Scaled Dot-Product Attention~(SDPA) implementation."
  • Class-conditional generation: Generating images conditioned on class labels to control the output category. "We conduct all experiments with class-conditional generation on ImageNet~\citep{imagenet} at a resolution of 256×256."
  • Classifier-Free Guidance (CFG): A sampling technique that improves diffusion/flow model outputs by adjusting conditional/unconditional predictions. "Diffusion/flow entries are reported under CFG, when applicable."
  • CKNNA: A feature-alignment metric used to compare model representations against foundation models. "\subcaption{CKNNA of G and D}"
  • [CLS] token: A special transformer token used for classification tasks, appended to the input sequence. "a dedicated [cls][\text{cls}] token is appended to the sequence of visual tokens before the first transformer block."
  • Differentiable augmentation: Data augmentation methods implemented as differentiable operations within the training graph. "During training, we apply differentiable augmentation~\citep{diffaug}."
  • DINOv2: A vision foundation model providing strong image representations used for alignment. "Let ϕ()\phi(\cdot) be a frozen vision foundation model~(e.g., DINOv2~\citep{dinov2}),"
  • Equalized learning rate: A technique that scales parameter updates to be invariant to layer/channel size for stable training. "Our rule is conceptually related to the equalized learning rate~\citep{progan} used in conventional GANs,"
  • Exponential Moving Average (EMA): Smoothing of model parameters over training steps to stabilize and improve generation. "We apply exponential moving average~(EMA) to the generator with decay $0.999$."
  • FID (Fréchet Inception Distance): A metric that quantifies the quality and diversity of generated images by comparing feature distributions. "GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance~(FID of 2.96) on ImageNet-256 in just 40 epochs,"
  • GFLOPs: Giga Floating Point Operations, a measure of computational complexity for model forward passes. "Model complexity is commonly measured by GFLOPs."
  • GAT (Generative Adversarial Transformers): A GAN framework that uses transformers in VAE latent space for scalable image synthesis. "We introduce Generative Adversarial Transformers~(GAT), a transformer-based GAN framework at the latent space of VAE, for the first time."
  • GAN (Generative Adversarial Networks): A framework with a generator and discriminator trained adversarially to synthesize realistic data. "Generative Adversarial Networks~(GAN)~\citep{GAN} is an adversarial learning framework between two networks, the generator G(z,c)G(z, c) and discriminator D(I,c)D(I, c)."
  • Gradient checkpointing: A memory-saving technique that trades compute for reduced activation storage by recomputing during backpropagation. "Also, we use a batch size of $512$, bfloat16 precision, gradient checkpointing, and PyTorch Scaled Dot-Product Attention~(SDPA) implementation."
  • Layerscale: A per-layer learnable scaling mechanism that stabilizes deep transformer training by modulating block outputs. "with Layerscale applied to the output of each transformer block."
  • Latent interpolation: Smoothly transitioning between latent codes to demonstrate semantic continuity in the generator’s latent space. "while keeping the characteristics of GANs such as latent interpolation~(bottom two rows)."
  • Latent-space guidance: A guidance technique operating entirely within the GAN’s latent/style space to steer outputs with negligible overhead. "we employ latent-space guidance~\citep{gandance} with a strength of 1.1, applied to the first 30\% of transformer blocks."
  • LPIPS: A perceptual similarity metric based on deep features that correlates with human judgment. "LPIPS distances while ablating Transformer blocks one by one."
  • Mapping network: An MLP that transforms latent code and condition into style vectors used to modulate generator features. "we employ a mapping network, a simple MLP, that generates a style vector ww from zz and cc."
  • MNG (Multi-level Noise-perturbed image Guidance): A training strategy that supervises generator intermediate outputs at multiple noise levels to activate early layers. "To this end, we propose the Multi-level Noise-perturbed image Guidance~(MNG) strategy for training GANs."
  • NFE (Number of Function Evaluation): The number of model evaluations required during sampling; lower NFE implies faster generation. "(Left) 1 or 2 Number of Function Evaluation~(NFE) generative models."
  • PCA (Principal Component Analysis): A dimensionality reduction technique used here to visualize intermediate features across layers. "we visualize intermediate features for each transformer block using PCA."
  • Patchify/unpatchify layer: Operations to convert images to token sequences (patchify) and back to images (unpatchify) for ViT-based models. "we remove the patchify layer and instead introduce an unpatchify layer"
  • Projection discriminator: A discriminator that incorporates class conditioning via inner products with class embeddings. "For class conditioning of discriminator, we use the projection discriminator~\citep{projection-disc}."
  • qk-normalization: A normalization technique applied to query and key tensors in attention to stabilize training. "Rotary Positional Embeddings~(RoPE)~\citep{rope}, SwiGLU-FFN~\citep{swigluffn}, and qk-normalization."
  • REPA: A representation alignment objective that aligns discriminator tokens with features from a vision foundation model. "The REPA objective substantially improves performance, indicating that advances from diffusion models can transfer effectively to GAT."
  • Relativistic pairing loss: An adversarial loss where the discriminator compares real and fake logits relativistically to enhance stability. "we deploy relativistic pairing loss~\citep{rpgan} with the approximated version of two-sided gradient penalty~\citep{seaweed-apt}, following R3GAN~\citep{r3gan}."
  • RMSNorm: A normalization layer that scales activations by their root mean square without a bias term. "Since we adopt RMSNorm, the shift parameter is omitted."
  • RoPE (Rotary Positional Embeddings): A positional encoding method that injects rotation-based position information into attention. "Rotary Positional Embeddings~(RoPE)~\citep{rope}, SwiGLU-FFN~\citep{swigluffn}, and qk-normalization."
  • SDPA (Scaled Dot-Product Attention): The attention computation used in transformers, here via PyTorch’s optimized implementation. "PyTorch Scaled Dot-Product Attention~(SDPA) implementation."
  • SD-VAE (Stable Diffusion VAE): The variational autoencoder from Stable Diffusion used as a tokenizer to map images to latent space. "we employ the pre-trained Stable Diffusion variational autoencoder~(SD-VAE)~\citep{LDM} as a tokenizer"
  • Style vector: A vector derived from latent code and condition that modulates transformer features via adaptive normalization. "This style ww is then used to modulate features through adaptive normalization and Layerscale"
  • SwiGLU-FFN: A feed-forward network variant using SwiGLU activation for improved transformer performance. "Rotary Positional Embeddings~(RoPE)~\citep{rope}, SwiGLU-FFN~\citep{swigluffn}, and qk-normalization."
  • Truncation trick: A sampling technique that limits latent deviations to improve sample quality at the cost of diversity. "evaluated without the truncation trick or guidance~\citep{gandance}, unless specified."
  • Two-sided gradient penalty: A regularization that penalizes discriminator sensitivity by comparing outputs on perturbed real and fake inputs. "we deploy relativistic pairing loss~\citep{rpgan} with the approximated version of two-sided gradient penalty~\citep{seaweed-apt},"
  • VAE latent space: A lower-dimensional representation learned by a variational autoencoder where generative models can be trained efficiently. "we build GAT on the latent space of VAE~\citep{LDM}, following the recent advances in generative models"
  • VFM (Vision Foundation Models): Powerful pretrained vision models whose features are used for representation alignment. "Vision Foundation Models~(VFM)"
  • ViT (Vision Transformer): A transformer architecture for images that operates on patch tokens rather than convolutional features. "Our generator adopts a standard Vision Transformer~(ViT) architecture"
  • Width-aware learning-rate adjustment: Scaling the learning rate inversely with channel width to keep output changes consistent across model sizes. "Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment."
Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 185 likes.

Upgrade to Pro to view all of the tweets about this paper:

alphaXiv

  1. Scalable GANs with Transformers (16 likes, 0 questions)