Papers
Topics
Authors
Recent
Search
2000 character limit reached

Registers Matter for Pixel-Space Diffusion Transformers

Published 15 May 2026 in cs.CV | (2605.16147v1)

Abstract: Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly incorporate register-like mechanisms, which may partially account for their strong empirical performance. Motivated by these insights, we investigate a parameter-efficient dual-stream architecture that specializes processing for register tokens and improves pixel-space generation quality with negligible runtime overhead.

Summary

  • The paper demonstrates that register tokens significantly lower feature norm variability and improve FID scores by acting as norm sinks and aggregating global information.
  • The study reveals that registers are most effective in pixel-space settings when introduced later and in higher quantities, with a dual-stream architecture yielding notable performance gains at minimal computational cost.
  • The work emphasizes that specialized, register-aware designs enhance intermediate feature interpretability and robustness across early and mid-stage denoising processes.

Register Tokens in Pixel-Space Diffusion Transformers: Role, Mechanisms, and Architectural Implications

Overview

"Registers Matter for Pixel-Space Diffusion Transformers" (2605.16147) delivers a comprehensive analysis of the function of register tokens in Diffusion Transformers (DiTs), particularly in the pixel space setting. While register tokens were originally introduced in Vision Transformers (ViTs) to mitigate high-norm outlier artifacts—in particular, to regularize attention distributions and improve feature map interpretability—the motivation, efficacy, and internal dynamics of register usage in DiTs had not been systematically characterized prior to this work. This paper establishes that, despite notable differences from the ViT setting (notably, the absence of patch-token outliers in DiTs), register tokens significantly enhance both convergence and final generation quality for pixel-space DiTs. Mechanistic analysis shows that these benefits arise through distinct phenomena from the ViT case—including norm sinking and global information aggregation—ultimately producing cleaner and more structured intermediate features, especially in the early and mid denoising stages. These insights are further leveraged to propose an efficient dual-stream DiT architecture with register-aware specialization, yielding systematic FID improvements at minimal computational cost.

Register Tokens in DiTs vs. ViTs

A primary empirical result is that, unlike ViTs, both latent-space and pixel-space DiTs do not intrinsically generate high-norm patch-token outliers. In ViTs, high-norm outliers (typically in low-information/background regions) motivate the deployment of registers to 'sink' global information and prevent these artifacts from contaminating the patch representations. However, in DiTs, attention maps and token feature distributions are near-uniform in the absence of registers, lacking the low-information artifacts found in ViT models.

Despite this, adding registers to DiTs induces the emergence of high-norm tokens—now contained within the registers themselves. Surprisingly, this addition consistently and substantially reduces FID (for example, FID improves from 7.39 to 5.30 in base pDiT models on ImageNet 256×256), with effects strictly dominant in pixel-space settings and minimal or even negative effects in some latent-space models.

Mechanistic Insights: Smoothing and Information Specialization

Analysis of intermediate representations reveals that register tokens serve two principal roles in pixel-space DiTs:

  1. Norm Sinking: A subset of register tokens absorbs magnitude (i.e., become high-norm outliers), thereby reducing the feature norms of all patch tokens. This norm reduction regularizes local variability in the spatial features, resulting in smoother and more spatially coherent intermediate feature maps.
  2. Global Information Aggregation: Other register tokens, as probed via class linear separability, encode diverse semantic and global properties of the input image, attending to distinctive semantic regions. The effect is most pronounced at high-noise diffusion steps (t[0,0.2]t \in [0, 0.2]), which are especially critical for flow-matched models in pixel space.

Total Variation (TV) and correlation decay metrics further quantify smoother spatial organization induced by registers—a phenomenon not observed in self-supervised ViTs with registers. This suggests that the benefit of registers in DiTs is not limited to simply absorbing outliers but more fundamentally to regularizing high-dimensional and noisy intermediate signal propagation, especially in pixel spaces with higher intrinsic noise.

Registers in the Broader Context of DiT Architectures

The analysis extends to contemporary architectural trends in DiTs, such as models that exploit in-context conditioning by appending auxiliary class or text tokens. The paper demonstrates that these in-context tokens functionally act as implicit registers, exhibiting similar norm sinking and global semantics behaviors. Empirically, most of the generation quality gains previously attributed to explicit class conditioning can be traced to this implicit register-like effect, rather than the mere inclusion of class information per se.

Furthermore, ablation studies clarify two critical design insights for register deployment:

  • Registers are only beneficial if introduced after several initial transformer layers (after layer 4 in pDiT-B), unlike in ViTs where early introduction is effective. Early-layer registers lack semantic structure and act as ineffective norm sinks, degrading overall performance.
  • Pixel-space DiTs require substantially more register tokens (e.g., 32) to reach optimality compared to ViTs, reflecting the higher complexity and variability in pixel-level generative modeling.

Dual-Stream Architecture: Efficient Specialization for Registers

Capitalizing on the distinct roles of registers and patch tokens, the authors develop and validate a parameter-efficient dual-stream DiT architecture. In this design, key transformer components (MLP, adaLN, RMSNorm) are decoupled for register vs. patch tokens, with parameter sharing or LoRA-based specialization employed for efficiency. Registers are only inserted in deeper blocks (as motivated empirically), and the architecture increases parameters by ~14% with no significant runtime penalty (in GFLOPs).

This dual-stream approach further improves FID across multiple model and resolution scales (e.g., FID drops from 3.71 to 3.41 on ImageNet 256×256 for base models) and remains effective when combined with recent representational alignment techniques (such as PixelREPA).

Practical and Theoretical Implications

The results carry several implications:

  • For DiT Training: Registers should be included as standard practice in pixel-space and potentially high-noise or high-dimensional DiT setups, introduced only after sufficient early representation learning has occurred, and in sufficient quantity to absorb and specialize global information.
  • For DiT Architecture Design: Specialized register-aware parameterization is beneficial and scales efficiently. Dual-stream or otherwise specialized handling of heterotypic tokens (registers, text/class, etc.) may become foundational in future state-of-the-art generative models.
  • For Broader Transformer Research: The phenomenon of global information sinking and specialization of tokens appears repeatedly across modalities—vision, text, video, and multimodal generative models—inviting further unified theoretical treatment.
  • For Interpretability and Robustness: Register tokens improve not only numerical metrics (e.g., FID) but also internal feature structure and interpretability, reinforcing the utility of architectural interventions for model robustness and transparency.

Conclusion

This study rigorously establishes that register tokens play a vital, mechanistically unique role in pixel-space Diffusion Transformers, despite the absence of the outlier artifacts that originally motivated them in ViTs. Their introduction consistently improves pixel-space DiT representations by regularizing spatial features and aggregating global semantic information. The dual-stream architectures proposed to support register specialization demonstrate that further, parameter-efficient advances are feasible. These findings set a new standard for DiT architectural design and open several avenues for deeper theoretical understanding and practical innovation in transformer-based generative models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper studies a special kind of “extra tokens” called register tokens inside image-generating AI models that use transformers. The authors ask: if adding register tokens helps Vision Transformers (used to understand images), will it also help Diffusion Transformers (used to create images), especially when these models draw directly in pixel space?

Key terms explained

  • Transformer: A computer model that reads data in pieces and lets each piece “pay attention” to other pieces to understand or create something.
  • Token: One small piece of the input. For images, this is usually a patch (like a tile in a mosaic).
  • Register tokens: Extra learnable tokens added to the model. Think of them as special sticky notes the model can use for big-picture info or to absorb noise.
  • Diffusion model: A model that starts with noise and learns to remove it step-by-step to produce a clean image.
  • Pixel space vs. latent space:
    • Pixel space: The model works directly on full-resolution pixels (like painting the final picture).
    • Latent space: The model works in a smaller, compressed space (like drawing a thumbnail sketch first).
  • Feature norm: A number that shows how “loud” or “strong” a token’s features are. Very high norms mean a token is shouting.
  • Attention map: A visualization of which parts (tokens) focus on which other parts.
  • FID (Fréchet Inception Distance): A score that measures how realistic generated images look. Lower is better.
  • Total Variation (TV): A measure of how smooth an image or feature map is. Lower TV means less bumpiness.

What questions did the researchers ask?

  1. Do Diffusion Transformers (DiTs) have the same “loud token” problem as Vision Transformers (ViTs), where a few patch tokens become outliers with unusually high norms?
  2. Even if DiTs don’t have that problem, could register tokens still help these models generate better images—especially when trained in pixel space?
  3. If register tokens help, how exactly do they help? What role do they play inside the model?
  4. Are existing DiT designs already using something like register tokens without calling them that?
  5. Can we design a smarter architecture that treats register tokens and patch tokens differently to improve image quality efficiently?

How did they study it?

  • They trained Diffusion Transformers on ImageNet (a large image dataset) to generate 256×256 images using a method called flow matching (a way to guide the denoising steps).
  • They compared models:
    • Without register tokens.
    • With register tokens added.
    • With “in-context” class tokens (extra tokens used for class labels), which they suspected behave like register tokens.
  • They looked at:
    • Attention maps to see if tokens latch onto unhelpful regions.
    • Token norms to check for “shouting” tokens.
    • FID scores to compare image quality.
    • TV values of intermediate features to see if features become smoother.
    • Linear probing (a simple classifier) on register tokens to see whether those tokens carry meaningful global information about the image or mostly act as “norm sinks” (sponges for magnitude).

What did they find and why is it important?

Here are the main results:

  • DiTs don’t have the patch-token outlier problem:
    • Unlike ViTs, DiTs’ patch tokens mostly have similar norms and clean attention. They don’t focus on empty backgrounds or produce noisy artifacts.
  • Register tokens still help DiTs in pixel space:
    • Even without outlier patch tokens, adding register tokens significantly improves image quality (lower FID) across different model sizes.
    • In pixel space, registers consistently reduce the norms of image patch tokens, making features calmer and smoother, especially at early, high-noise steps. This likely helps the model build the main content of the image more reliably.
  • Registers play two roles:
    • Some registers are “norm sinks” with very high norms but low classification accuracy—they mostly absorb magnitude from patch tokens.
    • Other registers carry global semantic information with high classification accuracy—they summarize what’s in the image (like “there’s a bird on a branch”).
  • Pixel space benefits the most:
    • Register tokens give the biggest gains in pixel-space models.
    • In latent-space models, gains are smaller or can even drop (depending on the setup), likely because latent spaces are already smoother and simpler.
  • Better to add registers later and add more of them:
    • Registers work best when introduced in deeper layers (not from the very first layer).
    • DiTs benefit from more registers than ViTs (e.g., dozens rather than only a few).
  • “In-context” class tokens act like registers:
    • Extra class tokens used in some recent pixel-space DiTs behave similarly to registers: some become norm sinks, others carry broad global info. Much of their improvement seems to come from this register-like behavior rather than class labels alone.
  • A smarter, dual-stream design helps:
    • Treat register tokens and patch tokens as two streams that share attention but have lightly specialized components.
    • This “dual-stream” setup improves image quality with only about a 14% increase in parameters and no extra runtime cost (same GFLOPS).

Why it matters:

  • These insights explain why recent pixel-space transformer models for image generation work so well.
  • They show a simple, effective way—register tokens—to stabilize and improve pixel-space training, which is typically harder.
  • The dual-stream approach gives a practical blueprint to get better images efficiently.

What’s the potential impact?

  • Better image generators: Using register tokens, especially in pixel-space Diffusion Transformers, can make models generate cleaner, more realistic images faster and more reliably.
  • Smarter model designs: Treating special tokens differently from patch tokens (dual-stream) could become a standard practice, improving quality without making models much heavier.
  • Clearer training recipes: Add registers in deeper layers, use more of them than in ViTs, and recognize that “in-context” tokens often act as registers. This can help researchers and engineers build stronger image models.
  • Broader lessons: The idea of tokens that absorb noise or carry global summaries might apply to other generative models (like text-to-image or video), guiding better architectures across AI fields.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues and concrete directions that the paper leaves open for future research:

  • Mechanistic causality of registers’ benefits
    • Establish causal evidence that norm reduction and increased spatial smoothness at high noise directly drive FID gains (e.g., intervene on register tokens’ norms via scaling/clamping; freeze or drop subsets of registers during training/inference; perturb TV via auxiliary losses to test effect on quality).
    • Validate the hypothesis that “all patch tokens participate in the loss” prevents patch outliers by training variants where only a subset of patch tokens contributes to the objective to see if patch outliers emerge and whether registers then become redundant.
  • Why registers degrade performance in RAE space
    • Diagnose the failure mode in RAE-based models: compare per-layer norms, TV, and probing across spaces; test whether RAE encoders already compress away information that registers would absorb; try register-aware modifications of the autoencoder or loss to reconcile the mismatch.
  • Early-layer ineffectiveness of registers
    • Test alternative insertion strategies for early layers (e.g., cross-attention-only exposure, gating, or delayed residual injection) and measure whether richer low-level targets (auxiliary edge/texture losses) enable early registers to carry useful information.
    • Quantify how much semantic structure is present at each layer/timestep and whether boosting it (e.g., REPA/iREPA-like spatial losses early on) makes early registers effective.
  • Scaling laws for register count and placement
    • Determine how the optimal number of register tokens scales with model width/depth, image resolution, patch size, and sequence length; produce empirical scaling laws and heuristics for placement (start/end layers) that generalize across settings.
  • Interaction with normalization and massive activations
    • Systematically compare RMSNorm vs LayerNorm, PreNorm vs PostNorm, and value/attention gating on the emergence of register sinks, channel-wise massive activations, and downstream quality.
    • Measure channel-dimension outliers (kurtosis, heavy tails) with/without registers and evaluate whether combining register tokens with massive-activation modulation yields additive gains.
  • Objective and sampler dependence
    • Test registers under different training targets and frameworks (ε-/v-prediction, EDM/score matching vs flow matching), timesteps schedules, and samplers; assess whether register benefits are robust to one-step/few-step samplers and different guidance schemes (e.g., classifier-free guidance scales).
  • Generalization across datasets, tasks, and modalities
    • Evaluate on diverse datasets (COCO, LAION subsets, non-natural images), higher resolutions (>512), and non-class-conditional or unconditional setups to verify external validity.
    • Assess utility in conditional tasks beyond class labels (text-to-image, layout/segmentation/inpainting) with controlled experiments that disentangle semantic conditioning from register-like effects (e.g., masking or shuffling conditioning content while keeping token slots).
  • Registers vs in-context tokens: disentangling roles
    • Run controlled studies where in-context tokens are contentless (noise or constant vectors) yet occupy the same slots to quantify how much of JiT’s gain is register-like vs genuinely semantic.
    • For text-to-image, explicitly add “pure” register tokens alongside text tokens and measure complementarity, interference, and role specialization.
  • Quantitative evaluation breadth and reliability
    • Go beyond FID to include precision/recall, density/coverage, CLIP score, diversity metrics, and human studies; report confidence intervals over multiple seeds to assess statistical significance.
    • Analyze convergence dynamics (time/steps-to-target-FID) to substantiate the claim that registers improve convergence.
  • Dual-stream design space and efficiency
    • Examine additional dualization choices (e.g., selective decoupling per head, attention output-only decoupling, token-wise expert routing) and LoRA rank/placement sensitivity; report wall-clock time, memory, and activation checkpointing costs, not just GFLOPs/params.
    • Validate the compact dual-stream design at larger scales and with longer auxiliary sequences (e.g., many text tokens) to confirm scalability and stability.
  • Register training strategies and regularization
    • Explore initialization schemes (e.g., pretrained encoder features vs random), weight tying across layers, dropout/stochastic depth on registers, or auxiliary losses (entropy, information bottleneck, decorrelation) to control sink vs semantic roles.
    • Test robustness to register ablations at inference (token dropout, reordering) to probe redundancy and failure modes.
  • Theoretical understanding
    • Develop a theoretical account linking diffusion dynamics, token-wise loss participation, and the emergence of sink tokens; model how registers alter information flow and gradient routing at high noise levels.
  • Robustness and safety
    • Investigate whether registers increase memorization risks or sensitivity to adversarial/noisy inputs; evaluate OOD robustness and failure patterns when registers are perturbed.
  • Practical deployment considerations
    • Quantify memory footprint and latency impacts from adding many registers, especially at high resolutions/long sequences; identify minimal effective configurations under tight resource budgets.
  • Extensions to video and 3D
    • Test whether register behavior and dual-stream benefits transfer to spatiotemporal (video) and 3D diffusion transformers, including the role of registers in stabilizing long-horizon generation without attention sinks.

Practical Applications

Immediate Applications

The following use cases can be deployed now by teams working with diffusion transformers, especially pixel-space DiTs trained with flow matching.

  • Production image generation quality boosts at constant inference cost
    • Sectors: software, media/entertainment, e-commerce, design/advertising
    • What to do: add register tokens (introduced from deeper layers only, e.g., after layer 4; use a larger count such as ~32 for base models) to existing pixel-space DiTs; or switch to the compact dual-stream design (dualize adaLN/MLP/RMSNorm; keep Attention shared).
    • Expected outcome: better FID/visual coherence and smoother spatial structure, with negligible runtime overhead (same GFLOPs; ~14% parameter increase).
    • Assumptions/dependencies: strongest gains in pixel space; validate on your data/domain beyond ImageNet 256–512; requires fine-tuning or retraining to learn registers.
  • Drop-in improvement for in-context conditioning pipelines
    • Sectors: software, multimodal generation tools
    • What to do: treat duplicated class/text tokens as implicit registers; deploy them in deeper layers and at adequate token counts to exploit “norm sink + global info” behavior.
    • Expected outcome: recover most of the quality gain attributed to in-context tokens even without extra class content; additional conditioning still helps on top.
    • Assumptions/dependencies: architecture similar to JiT/MM-DiT-style token appending; benefits attributable to register-like dynamics, not just label/text content.
  • Training stabilization and faster convergence for pixel-space DiTs
    • Sectors: AI/ML engineering, foundation model training
    • What to do: introduce registers during training to reduce patch-token norms and improve feature smoothness at high-noise timesteps (t ∈ [0, 0.2]).
    • Expected outcome: smoother intermediate features, easier optimization in high-dimensional pixel space, improved early-epoch FID.
    • Assumptions/dependencies: effect is tied to high-noise stages in flow/diffusion objectives; verify convergence speed improvements in your setup.
  • Architecture upgrade kits for existing codebases
    • Sectors: software tooling, open-source frameworks
    • What to do: ship a “register-aware head” or “compact dual-stream” plugin (LoRA branches on adaLN and Attention for register tokens; dual MLP outputs; separate RMSNorm) for PyTorch/Transformers/Diffusers-style stacks.
    • Expected outcome: practical, low-effort quality lift for production models; straightforward ablations (number/placement of registers) as part of model config.
    • Assumptions/dependencies: minor engineering to split token streams and manage parameter routing; retraining or fine-tuning still required.
  • Quality and stability gains for synthetic data generation
    • Sectors: robotics/simulation, autonomous driving, retail/e-commerce search, medical imaging research
    • What to do: use register-enhanced pixel-space DiTs to generate higher-coherence images for data augmentation and simulation.
    • Expected outcome: improved utility of synthetic data (fewer artifacts, better spatial coherence).
    • Assumptions/dependencies: domain shift must be addressed; for regulated domains (e.g., healthcare), perform rigorous validation.
  • Training-time diagnostics and monitoring
    • Sectors: MLOps, research labs
    • What to do: track token norm distributions and total variation (TV) of intermediate feature maps across timesteps to detect unhealthy dynamics; ensure registers are acting as norm sinks/global carriers.
    • Expected outcome: earlier detection of degenerate behavior, targeted hyperparameter/placement adjustments (e.g., move register start layer deeper).
    • Assumptions/dependencies: requires minor instrumentation; norms/TV are task-agnostic proxies and should be complemented with validation metrics.
  • Parameter-efficient quality scaling for cost-sensitive deployments
    • Sectors: cloud/AI infrastructure, startups
    • What to do: adopt compact dual-stream design to extract extra quality without increasing inference compute.
    • Expected outcome: better price–performance on GPUs/accelerators; helpful when GFLOPs budgets are fixed.
    • Assumptions/dependencies: parameter growth is small (~14%); memory footprint must still fit serving constraints.
  • Better ablation protocols and baselines for academic comparisons
    • Sectors: academia
    • What to do: include “register vs no-register” and “deep-only register placement” baselines in DiT papers; report TV and token-norm profiles alongside FID/IS.
    • Expected outcome: fairer, more interpretable comparisons; clarity on whether pixel-space gains stem from register-like mechanisms.
    • Assumptions/dependencies: standardized reporting improves reproducibility; applies primarily to pixel-space settings.

Long-Term Applications

The following use cases require additional research, scaling experiments, or system integration beyond what is directly shown in the paper.

  • Register-aware control and editing handles
    • Sectors: creative tools, interactive design, VFX
    • Idea: directly steer “global information” registers (vs norm-sink registers) to control style, layout, or background/foreground emphasis.
    • Potential product: a UI exposing sliders/constraints bound to specific register tokens.
    • Dependencies: reliable mapping from registers to semantic controls; robust disentanglement under diverse prompts/data.
  • Safer and more interpretable generative systems
    • Sectors: policy, responsible AI, enterprise compliance
    • Idea: use register token probes/attention maps to audit global semantics and detect anomalous activations; throttle norm sinks to prevent instabilities.
    • Potential workflow: compliance dashboards that monitor register behaviors across datasets and releases.
    • Dependencies: standardized interpretability metrics; domain-specific acceptance criteria.
  • Generalization to text-to-image, video, and multimodal DiTs at scale
    • Sectors: media, gaming, simulation, education
    • Idea: extend dual-stream specialization to large appended sequences (text, audio, video context), where benefits may grow with token count.
    • Potential product: next-gen MM-DiT variants with register-aware parameterization for higher resolution/longer context.
    • Dependencies: large-scale training; careful evaluation of how text tokens split into semantic carriers vs sinks.
  • AutoML for register scheduling and sizing
    • Sectors: ML platforms, AutoML tooling
    • Idea: automatically search the number of registers, start/end layers, and dualized components to maximize quality under parameter budgets.
    • Potential tool: “RegisterTuner” that optimizes architecture knobs based on validation FID and TV/norm criteria.
    • Dependencies: compute for search; transferability across datasets and resolutions.
  • Curriculum learning and representation alignment synergy
    • Sectors: research, model training providers
    • Idea: jointly schedule REPA/iREPA-style alignment with register introduction to prioritize spatial structure early and global semantics later.
    • Potential workflow: staged training that first stabilizes high-noise phases, then enriches semantic registers.
    • Dependencies: robust training curricula; ablations across objectives and noise schedules.
  • Hardware/runtime co-design for token-type specialization
    • Sectors: AI hardware, compiler/runtime vendors
    • Idea: exploit small register-token subsets for memory locality and parameter-routing optimizations; prioritize low-latency paths for register streams.
    • Potential product: kernels/runtime passes that branch compute by token type without hurting attention throughput.
    • Dependencies: token-type annotations at runtime; kernel support for mixed parameterization.
  • Domain-specific pixel-space generators (medical, satellite, fashion)
    • Sectors: healthcare, geospatial, retail/fashion
    • Idea: tailor register roles to domain priors (e.g., background anatomy vs lesions; land cover vs artifacts; textures vs silhouettes).
    • Potential product: domain-tuned DiTs with register probes for quality assurance and control.
    • Dependencies: curated datasets; rigorous domain validation; regulatory approval where applicable.
  • Cross-domain extension to audio and diffusion LLMs
    • Sectors: audio synthesis, speech, NLP
    • Idea: add register/sink tokens to stabilize high-noise or autoregressive unmasking phases (building on moving-sink insights in DLMs).
    • Potential product: “sink-stabilized” diffusion LMs/audio models with better long-horizon consistency.
    • Dependencies: task-specific adaptation; evaluation beyond images (intelligibility, coherence).
  • Energy and cost reductions via convergence improvements
    • Sectors: energy, cloud cost management
    • Idea: if registers reliably speed convergence, total training energy can be reduced for a target quality level.
    • Potential workflow: compare wall-clock/energy-to-FID curves with vs without registers across scales.
    • Dependencies: confirmation that epoch-to-quality gains translate to fewer total steps in large-scale settings.
  • Standardization of diagnostics and benchmarks
    • Sectors: academia, standards bodies
    • Idea: add token-norm outlier rates and intermediate TV to benchmark suites for diffusion models.
    • Potential outcome: better comparability and early detection of regimes where registers help or hurt (e.g., RAE-space).
    • Dependencies: community adoption; reference implementations and dataset-agnostic thresholds.

Glossary

  • adaLN: An adaptive layer normalization module that conditions normalization parameters on context signals. "JiT blocks consist of RMSNorm, adaLN, Attention, and MLP layers."
  • attention sinks: Tokens that consistently attract disproportionate attention, acting as hubs in attention maps. "These questions are also related to broader studies of attention sinks and special-token behavior in transformers [29]"
  • DINOv2: A self-supervised Vision Transformer producing robust visual features used as a baseline or teacher model. "As a representative ViT-based model, we consider DINOv2 [6]."
  • Diffusion Transformers (DiTs): Transformer-based architectures used as the denoiser/backbone in diffusion generative models. "This progress brings Diffusion Transformers (DiTs) closer to ViTs"
  • dual-stream architecture: A design that processes different token types (e.g., registers vs patches) with specialized parameter paths while allowing interaction. "we investigate dual-stream designs that enable specialized processing of register tokens in pixel-space DiTs."
  • FID (Fréchet Inception Distance): A metric for image generation quality based on the Fréchet distance between feature distributions. "Generation quality is evaluated using FID [40]."
  • flow matching: A training objective that learns a continuous vector field mapping between data and noise distributions for generative modeling. "We train pDiTs using flow matching [36, 37] on ImageNet [38] at resolution 256×256 with patch size 16."
  • high-norm patch-token outliers: Patch tokens whose embedding norms are much larger than others, often causing attention artifacts. "Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality"
  • in-context conditioning: Injecting conditioning information by appending learned or duplicated condition tokens into the model’s input sequence. "JiT [24], a pixel-space DiT, employs in-context conditioning by adding duplicated class embeddings to the input sequence"
  • in-context tokens: The appended conditioning tokens processed alongside image tokens within the transformer. "We enable in-context tokens at layer 4, resulting in single-stream layers 0-3 and dual-stream layers 4-11."
  • JiT: A pixel-space Diffusion Transformer architecture that adds in-context conditioning tokens to improve generation. "We consider the JiT architecture [24], which uses in-context conditioning."
  • latent diffusion models: Diffusion models that operate in a compressed latent space produced by an autoencoder rather than directly on pixels. "training directly in pixel space [24, 25, 26], as an alternative to latent diffusion models that rely on pretrained autoencoders [27, 28]."
  • latent space: A lower-dimensional, structured representation produced by an encoder where diffusion can be performed more efficiently. "We additionally analyze latent-space architectures, SiT [21] and RAE [39], using their original backbone designs and training pipelines."
  • linear probing: Evaluating the information content of features by training a linear classifier on frozen representations. "we perform linear probing using register-token features extracted from an intermediate transformer block"
  • LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique that injects low-rank updates into existing weight matrices. "we use parameter-efficient LoRA adaptations [47], following [48]."
  • MM-DiT: A large-scale text-to-image Diffusion Transformer variant that integrates multimodal tokens. "We also analyze large-scale text-to-image models based on MM-DiT [42]"
  • norm sink: A token that accumulates large activation norms, effectively absorbing magnitude from other tokens to stabilize representations. "some act as norm sinks, while others encode global semantic information."
  • PCA (Principal Component Analysis): A dimensionality-reduction technique used to visualize and interpret high-dimensional feature maps. "We visualize feature maps using PCA, which qualitatively confirms this effect."
  • patch tokens: Tokenized embeddings of image patches that serve as the sequence input to vision transformers. "modeling images as sequences of patch tokens processed via self-attention [4]."
  • pixel space: The raw image domain where diffusion operates directly on pixels rather than on encoded latents. "we primarily focus on pixel-space DiTs based on the standard architecture [20]"
  • RAE (Representation Autoencoder): An autoencoder approach used to define a latent space for diffusion, distinct from standard VAEs. "Registers yield the largest improvements in pixel space, moderate gains in VAE space, and degraded performance in RAE space."
  • register tokens (registers): Extra learnable tokens appended to the token sequence that absorb outliers and carry global information to improve representations. "Registers are implemented as additional learnable tokens appended to the patch-token sequence following [12]"
  • representation alignment: Techniques that align internal features of diffusion models with those of pretrained vision encoders to aid training and convergence. "The insights from Sections 2.2 and 2.3 may relate to recent representation-alignment methods [44, 45]"
  • RMSNorm: Root Mean Square Layer Normalization, a normalization layer based on the RMS of activations. "JiT blocks consist of RMSNorm, adaLN, Attention, and MLP layers."
  • SwiGLU: A gated MLP activation using SiLU gating and linear projections to improve expressivity and stability. "In MLP, we compute a shared SwiGLU (Linear projection followed by SiLU gating) and apply separate output projections for register and patch tokens"
  • Total Variation (TV): A measure of spatial smoothness that quantifies the sum of absolute differences between neighboring values. "we consider Total Variation (TV) [41], which measures spatial smoothness by quantifying intensity differences between adjacent pixels."
  • VAE space: The latent representation space produced by a Variational Autoencoder in which diffusion can be performed. "register tokens show the largest improvements in pixel space, provide smaller gains in VAE space, and, interestingly, degrade performance in RAE-based models"
  • ViT (Vision Transformer): A transformer architecture for images that processes sequences of patch tokens via self-attention. "Vision Transformers (ViTs) [1, 2, 3] have become a dominant architecture for visual representation learning"
  • x-prediction: A diffusion model training target where the network directly predicts the clean data x from noisy inputs. "The model uses flow matching with x-prediction and the forward process xt = tx + (1 -t)e"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 166 likes about this paper.