Register Tokens in Diffusion Transformers

Updated 19 May 2026

Register tokens in Diffusion Transformers (DiTs) are dedicated, learnable vector slots appended to patch tokens to absorb high-norm outliers and maintain uniform attention distributions.
They are implemented using dual-stream and recursive strategies that optimize integration in both vision encoders and denoiser modules, enhancing convergence and generation fidelity.
Empirical evaluations demonstrate that register tokens reduce feature norm variance, accelerate training (up to 4× faster convergence), and improve final sample quality with minimal computational overhead.

Register tokens in Diffusion Transformers (DiTs) are dedicated, learnable vector slots appended to the sequence of patch tokens processed by transformer blocks in diffusion-based generative models. Originally developed to absorb high-norm feature outliers in Vision Transformers (ViTs), register tokens now play a critical role in DiT architectures, serving as overflow reservoirs for signal magnitude, promoting smoother intermediate representations, accelerating convergence, and stabilizing training—even in models that do not exhibit classical ViT outlier pathologies. A range of mechanisms now exist for introducing, optimizing, and exploiting register tokens in both latent-space and pixel-space diffusion transformers.

1. Origins and Motivation

The motivation for register tokens in transformer models arises from observed pathological behaviors in ViTs, where a small number of patch tokens develop unusually large norms, becoming "attentional sinks" that absorb disproportionate attention while degrading the semantic quality of local features. In the ViT context, Darcet et al. demonstrated that appending several learnable register tokens to the patch sequence absorbs these outlier norms, restoring uniform activation magnitudes and normalizing internal attention distributions.

As transformer-based diffusion models—Diffusion Transformers (DiTs)—emerged for image and text-to-image generation, their increasing similarity to ViTs prompted investigation into whether comparable outlier phenomena and register remedies would hold. Empirical evidence shows that while latent-space DiTs and their encoders can develop high-norm tokens, pixel-space DiTs generally lack pre-existing patch-token outliers. Nevertheless, the introduction of register tokens benefits both latent and pixel-space settings by shifting or absorbing potential outliers, improving convergence, and enhancing generation fidelity (Wu et al., 6 May 2026, Starodubcev et al., 15 May 2026).

2. Formalization of Outlier Tokens and Register Tokens

Consider a sequence of $N$ patch tokens at layer $\ell$ of a transformer: $\mathbf{z}^\ell = [z^\ell_1,\dots,z^\ell_N] \in \mathbb{R}^{N\times d}$ with each $z^\ell_i\in\mathbb{R}^d$ . The Euclidean norm is $n^\ell_i = \|z^\ell_i\|_2$ . An outlier token is defined as: $n^\ell_i > \kappa \cdot \mathrm{median}_{j}(n^\ell_j)$ with typical $\kappa=2$ , or thresholded against a fixed $\tau$ : $\mathcal{O}^\ell = \{ i : n^\ell_i > \tau \}$ The outlier fraction per layer is given by: $\mathrm{OutlierFraction}(\ell) = \mathbb{E}_\text{images} \left[ \frac{|\mathcal{O}^\ell|}{N} \right]$ Register tokens are appended to the patch sequence: $\ell$ 0 with $\ell$ 1 learnable vectors $\ell$ 2. These participate in all self-attention and MLP computations but are excluded from the diffusion loss, serving purely internal roles.

3. Architectural Integration and Dual-Stage Register Mechanisms

Vision Encoder Registers: A single register is appended to the ViT encoder stream. In pretrained ViTs (used in latent- or autoencoder-based image diffusion), at test time, the register absorbs outlier clusters, with no encoder retraining (Wu et al., 6 May 2026).
Diffusion Model Registers: In the denoiser, typically $\ell$ 3 trainable register tokens are prepended starting from block $\ell$ 4 (empirically best at $\ell$ 5 for 32-block DiTs). These are optimized with model parameters via the standard diffusion loss.
Pixel-space DiTs: Register tokens are introduced after a fixed early block (e.g., block 4 in a 12-layer model) (Starodubcev et al., 15 May 2026). Architectural specialization via a dual-stream design splits patch and register streams for later layers, enabling separate normalization (RMSNorm), MLP, and optional LoRA-adapted normalization, while sharing attention projections.

A recursive test-time register scheme is used with frozen encoders: upon detecting outliers above threshold $\ell$ 6, dummy registers are appended and the sequence is re-encoded; typically a maximum of two iterations suffices.

4. Functional Roles and Mechanisms

Norm Sinks: Registers consistently acquire the highest norms and attract excess signal, acting as overflow vessels to stabilize magnitude and suppress spurious outlier activation without corrupting patch semantics or spatial detail (Starodubcev et al., 15 May 2026).
Global Semantic Carriers: Subsets of register tokens, as identified by linear probing, encode meaningful, global semantic context—recovering high classification accuracy and structuring attention on object or background context (Starodubcev et al., 15 May 2026).

This dual functionality is emergent. In single-stream processing, DiT tends to relegate norm excess to registers, while dual-stream designs explicitly enable separate optimization and dynamic usage of registers for overflow versus semantic summarization.

In practice, registers yield marked reductions in per-token norm variance, as well as in feature-map total variation (TV), especially at high-noise denoising steps, producing smoother, more coherent representations. Visualizations in both PCA space and patch-norm heatmaps corroborate more uniform attention and improved local patch semantics (Wu et al., 6 May 2026, Starodubcev et al., 15 May 2026).

5. Quantitative Impacts and Empirical Results

Empirical evaluation shows that register tokens in DiTs:

Accelerate Convergence: Notably lower FID during early training epochs (e.g., pDiT-B/16 @ epoch 1: FID $\ell$ 7) (Starodubcev et al., 15 May 2026).
Improve Final Sample Quality: Systematic FID gains at convergence, up to $\ell$ 8 on ImageNet, across model scales (see table below):

Model	Params	No Reg FID	Reg FID
pDiT-B/16-256	131M	4.13	3.17
pDiT-L/16-256	459M	2.62	2.47
pDiT-H/16-256	953M	2.80	2.47

Reduce Outlier Artifacts: Outlier fraction per layer sharply decreases in both encoder and denoiser when using register-based interventions (Wu et al., 6 May 2026).
Enhance Training Efficiency: Generation curves indicate up to 4× faster convergence with registers (Wu et al., 6 May 2026).
Minimal Overhead: Well-designed dual-stream registration adds only 14% parameters (compact dual) or negligible runtime overhead (Starodubcev et al., 15 May 2026). Compute increases of 9–14% GFLOPs are reported in latent-space DiTs with dual-stage registers (Wu et al., 6 May 2026).

In conditional diffusion, in-context class tokens and text tokens act as implicit registers and mainly drive FID improvement through this register-like function rather than through explicit label injection (Starodubcev et al., 15 May 2026).

6. Implementation Strategies and Practitioner Guidelines

Key guidelines for leveraging register tokens in DiTs:

Use $\ell$ 9 registers, introduced after $\mathbf{z}^\ell = [z^\ell_1,\dots,z^\ell_N] \in \mathbb{R}^{N\times d}$ 0 (e.g., after block 4 in a 12-layer DiT).
Apply a compact dual-stream architecture: share attention, but use separate normalization and output MLP heads for register streams, optionally with LoRA adapters for normalization.
For pretrained encoders, use recursive test-time register injection—append dummy registers if outlier threshold is exceeded and re-encode; terminate when all patch norms are below threshold.
Optimal insertion block for denoiser registers is empirically $\mathbf{z}^\ell = [z^\ell_1,\dots,z^\ell_N] \in \mathbb{R}^{N\times d}$ 1 of 32 (ImageNet, RAE-DiT).
No dedicated loss or regularization for registers is required; gradients from the main diffusion loss suffice.
Monitor patch-norm statistics and feature-map TV to assess register effectiveness.

7. Limitations and Open Directions

Known constraints of current register-based schemes include:

Slight computational cost increase ( $\mathbf{z}^\ell = [z^\ell_1,\dots,z^\ell_N] \in \mathbb{R}^{N\times d}$ 210% GFLOPs).
Test-time recursion for encoder registers requires 1–2 additional forward passes.
Absence of explicit regularization or sparsity constraints on register weights. Future work may explore enforcing sparsity, orthogonality, or alternative objectives to further refine register behavior.
Register benefits are most pronounced in pixel-space diffusion; in RAE (DINOv2) latent space, registers may even degrade performance (Starodubcev et al., 15 May 2026).

A plausible implication is that the efficacy of register tokens in DiTs is determined by both architectural design (single- vs. dual-stream) and data modality, and further study into the interplay between register optimization, semantic partitioning, and inherent model bottlenecks is warranted.

References:

Taming Outlier Tokens in Diffusion Transformers (Wu et al., 6 May 2026)
Registers Matter for Pixel-Space Diffusion Transformers (Starodubcev et al., 15 May 2026)

Markdown Report Issue Upgrade to Chat

References (2)

Taming Outlier Tokens in Diffusion Transformers (2026)

Registers Matter for Pixel-Space Diffusion Transformers (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Register Tokens in Diffusion Transformers (DiTs).