Rethinking Cross-Layer Information Routing in Diffusion Transformers
Abstract: Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Rethinking Cross-Layer Information Routing in Diffusion Transformers ā A Simple Guide
What is this paper about?
This paper looks at how Diffusion Transformers (DiTs) pass information from one layer to the next while they turn noisy images into clean, realistic ones. The authors find that the usual way layers āaddā their outputs together isnāt ideal for this step-by-step denoising process. They propose a new way to pass information called Diffusion-Adaptive Routing (DAR) that helps the model learn faster and make better images.
What questions are the researchers asking?
- Do standard Transformers pass information between layers in a good way for diffusion models that work over many denoising steps?
- What goes wrong when we use the usual āresidual additionā (just adding new layer output on top of the running sum) in DiTs?
- Can we design a smarter, time-aware way to combine information from previous layers that improves speed and image quality?
How did they study this?
First, they examined how information flows through a DiT as it gets deeper and as the denoising step changes from very noisy to almost clean. They measured three things in a standard DiT:
- Forward magnitude: think of this as the āvolumeā of the signal the model passes forward. They found it keeps getting louder layer by layer.
- Backward gradients: this is the āteaching signalā that helps earlier layers learn. They found it fades a lot in deeper layers.
- Similarity between neighboring layers: if two layers give almost the same output, thatās redundancy. They found many deep layers were very similar, meaning wasted effort.
Then, they built DAR, which changes how a layer decides what to keep from earlier layers. Instead of always adding everything equally, each layer looks back and picks which earlier outputs matter mostālike a DJ mixing only the tracks that fit the moment.
To keep memory use reasonable, they also group layers into small āchunksā and summarize each chunk, similar to writing a short summary at the end of every few chapters so you donāt have to re-read the whole book.
What is DAR and how does it work (in everyday terms)?
- Usual way: every layer adds its new output to a growing pile of past outputs. This treats all past layers as equally important, all the time.
- DARās way: each layer asks, āWhich earlier layers are most helpful right now?ā It then forms a weighted mix of earlier outputs rather than blindly adding everything. The weights:
- Are learned (the model figures them out by itself),
- Change with the denoising step (early steps need big-picture info; late steps need fine details),
- Depend on the current content.
In short, DAR is a smart, time-aware āroutingā system that chooses which past information to focus on at each step.
What did they find, and why is it important?
- They found three clear problems with standard residual addition in DiTs: 1) The forward signal āinflatesā as layers stack up (it gets too large). 2) The gradient (learning signal) to deeper layers gets very weak. 3) Many deep layers produce highly similar features (redundancy).
- DAR fixes these issues by letting each layer selectively attend to the most helpful earlier layers, and by changing the mix as the denoising step progresses. This helps the model:
- Learn faster,
- Use layers more efficiently,
- Keep important details when the image gets cleaner.
- On ImageNet 256Ć256 (a standard image benchmark), DAR:
- Improved quality by a noticeable margin: for example, it reduced FID (a lower score is better) from 9.67 to 7.56 in one setting, and to 6.92 in another.
- Matched the baselineās final image quality with about 8.75Ć fewer training iterations (much faster training).
- Stacked well with other techniques like REPA (a training trick that aligns internal features), giving roughly a 2Ć speed boost early in training when used together.
- Helped keep sharp textures and edges when fine-tuning big text-to-image models using fast distillation methods.
What is FID?
- FID (FrƩchet Inception Distance) is a common score that measures how close the generated images are to real images. Lower is better.
Why does this matter?
- For creators and labs: Faster training and better images mean lower costs and quicker iteration.
- For model design: The paper shows that āhow layers talk to each otherā (cross-layer routing) is a powerful and overlooked lever. Itās not just about bigger models or new objectivesāsmarter information flow matters a lot.
- For future research: DAR works alongside other improvements (like REPA) rather than replacing them, suggesting we can stack advances to get bigger gains.
- For applications: Better detail preservation and faster training benefit image generation, and the ideas may carry over to video or other generative tasks.
Takeaway
Diffusion models improve images step-by-step from noise to clarity. Because what matters changes over time (coarse structure first, fine details later), the model should also change how it combines information across layers. DAR gives the Transformer this time-aware āmixing board,ā leading to faster learning and better images.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of concrete gaps and unresolved questions that future work could address to solidify, generalize, or better understand the proposed Diffusion-Adaptive Routing (DAR) and its diagnostics.
- External validity across data and tasks:
- Does DARās advantage hold on diverse datasets (e.g., COCO, LAION, FFHQ), higher resolutions (ā„512px), and other modalities (video, 3D, audio)?
- How does DAR behave under long-context or high-token-count regimes typical of high-res latent grids or video frames?
- Wall-clock efficiency and resource accounting:
- The paper reports iteration-based speedups; what are the true wall-clock gains after accounting for per-iteration FLOPs, memory, activation checkpointing, and communication overhead introduced by the router?
- What is the inference-time throughput and memory impact for typical sampler budgets (e.g., 10ā50 NFEs) and larger batch sizes?
- Fairness of system-level comparisons:
- Results compare models trained with different budgets/recipes (e.g., U-ViT/U-DiT, SiT-Plus); controlled ablations with matched data, compute, and hyperparameters are needed to isolate DARās contribution.
- Robustness across samplers and guidance:
- DAR is evaluated mainly at 250 NFEs and CFG w=1.5; how do gains scale with few-step sampling (e.g., ā¤20 NFEs), different ODE/SDE solvers, and a wide sweep of CFG weights?
- Compatibility with broader DiT ecosystems:
- How does DAR interact with cross-attention-heavy T2I stacks (prompt adherence, alignment, safety filters)? Are router queries better conditioned on text embeddings or cross-attn states?
- Does DAR play well with alternative objectives (EDM, RF/Rectified Flow, v/ε/velocity parameterizations), distillation methods (LCM/LCM-LoRA), or schedule choices?
- Missing baselines from residual-routing literature:
- Head-to-head comparisons with residual-strength/scaling fixes (ReZero, LayerScale, DeepNet), normalization variants (PostNorm, Pre/Post hybrids, SiameseNorm), and multi-stream designs (Hyper-Connections, DenseFormer, mHC) in DiTs are absent; these could test whether DARās benefits exceed simpler or cheaper fixes.
- Router design space underexplored:
- Keys/queries are minimal (k_i = RMSNorm(v_i), single-head softmax). Would learned K/V projections, multi-head routing, temperature control, or alternative mixers (sparsemax/entmax, Gumbel-softmax, mixtures with residual gates) yield better sparsity, stability, or interpretability?
- Should routing be token-wise, head-wise, or feature-group-wise instead of global per-layer? What is the trade-off with compute and memory?
- Timestep-awareness mechanism needs stronger causal evidence:
- The linear probe shows t is decodable, but does ablating/tampering with timestep pathways (e.g., shuffling e(t), disabling adaLN modulation, randomized t during router computation) causally degrade DARās gains?
- How do DAR variants perform under different noise schedules or discrete timesteps?
- Diagnostic methodology limitations:
- āCounterfactual source importanceā via gradients w.r.t. inserted scalar gates is a local measure; does it predict actual performance when those gates are used to reweight sources at inference/training time?
- The depth/timestep āsymptomā analysis is shown prominently at t=1.0; comprehensive heatmaps over t, training stage, and seeds with confidence intervals would strengthen claims.
- Non-incremental aggregation theory is preliminary:
- The rateādistortion model and Proposition for chunk size S lack empirical estimation of α and sensitivity analyses; do predicted optima persist across depths L, model scales, and datasets?
- Alternative chunk summaries (e.g., learned pooling over sublayers, EMA, attention-pooled summaries) might outperform ālast outputā summaries; this is not tested.
- Chunking and scaling risks:
- For very deep stacks, how do routing stability, memory, and softmax condition numbers scale with chunk count and S? Are there failure modes (e.g., source-collapse to a few early summaries) that harm effective depth?
- Interpretability and function of selected sources:
- Which layers are selected across t (early vs deep), and do selections correlate with spatial frequency content or semantic abstraction? Can we map router weights to coarse-to-fine feature usage?
- Regularization and stability:
- Is entropy/sparsity regularization on routing weights beneficial to avoid collapse or oscillations? What is the impact of DropPath/Stochastic Depth, dropout, or attention masking on routing stability?
- Gradient-flow claims need broader evidence:
- While the paper shows symptom ātightening,ā full gradient-norm statistics across depth and t, with baselines like DeepNet/LayerScale, would clarify whether DAR uniquely improves gradient propagation.
- Retrofits and fine-tuning procedures:
- For large pre-trained T2I models, what is the best way to insert DAR (e.g., initialize router to identity, freeze/unfreeze which blocks)? How much compute is needed to recover/retain quality?
- Quantitative DMD evaluation is missing:
- The claim that DAR preserves high-frequency details during DMD is visual; quantitative metrics (LPIPS, DISTS, edge/texture preservation, FID/IS at few steps) and user studies are needed.
- Sensitivity to hyperparameters:
- How sensitive is DAR to router learning rates, temperature/scale of q·k, RMSNorm ε, optimizer momentum, and weight decay? Clear tuning guidelines would aid reproducibility.
- Numerical considerations:
- Does the router require special numerical stabilization (log-sum-exp scaling, precision for q/k, gradient clipping)? How does bf16/fp8 training and inference affect routing accuracy?
- Pruning and dynamic-depth opportunities:
- Can router weights guide block pruning, early-exit, or conditional computation at inference to reduce latency while preserving quality?
- Statistical reliability:
- Results lack confidence intervals/seed variance. Reporting multiple seeds, CIs for FID/IS/precision/recall, and ablation error bars would solidify conclusions.
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that can be implemented with current tooling, data, and compute. Each item lists sector(s), potential tools/products/workflows, and key assumptions or dependencies that affect feasibility.
- DAR as a drop-in residual replacement in DiT training pipelines
- Sectors: software/AI infrastructure, foundation model labs, academia
- What to do: Replace pre-norm residual addition with Diffusion-Adaptive Routing (DAR) in existing DiT backbones (e.g., SiT, DiT, MM-DiT, PixArt-style models) to improve sample quality and accelerate convergence.
- Tools/products/workflows:
- Add a āDAR moduleā into PyTorch/JAX Transformer blocks; expose config for static vs dynamic query and chunk size S (default Sā4).
- Provide a Trainer callback to log routing weights, hidden-state RMS, gradient RMS, and inter-block cosine similarity over timestep t.
- Create a Hugging Face Diffusers integration (config flag: routing="dar") and a PyTorch Lightning plugin for easy adoption.
- Assumptions/dependencies: Access to model internals to modify residual routing; modest engineering for activation checkpointing with chunked aggregation; compute overhead of the router is offset by fewer iterations to convergence; training data and recipes similar to SiT.
- Faster pretraining and fine-tuning of text-to-image (T2I) models with better high-frequency detail retention
- Sectors: media/entertainment/design, advertising/marketing, e-commerce/retail, academia
- What to do: Use DAR to reduce training iterations (8.75Ć fewer to match SiT baseline quality on ImageNet 256) and improve FID; apply DAR during Distribution Matching Distillation (DMD) to preserve fine textures, edges, and logos.
- Tools/products/workflows:
- āDAR-on-finetuneā switch in T2I fine-tuning scripts (including few-step DMD pipelines).
- Asset QA workflows focusing on edge acuity and micro-texture metrics; integrate with CFG samplers.
- Assumptions/dependencies: DMD or few-step distillation code in place; inference latency essentially unchanged (routing overhead is light vs denoising loop cost); licensing and content-safety filters still required for production use.
- Training-cost and carbon-footprint reduction without sacrificing quality
- Sectors: energy/climate reporting, enterprise MLOps, public-sector AI programs
- What to do: Adopt DAR to cut wall-clock compute for target quality thresholds; include cross-layer diagnostics in sustainability dashboards.
- Tools/products/workflows:
- MLOps dashboards showing FID/IS vs GPU-hours; add āDAR-enabledā run tags.
- Procurement justification reports highlighting energy savings and cost per FID point.
- Assumptions/dependencies: Organization tracks compute/energy; model quality targets and stopping criteria defined; routing overhead amortized by faster convergence.
- REPA + DAR combined training for early-stage acceleration
- Sectors: software/AI infrastructure, academia
- What to do: Combine DAR with REPA representation alignment to achieve compounding speedups (ā2Ć additional early-stage acceleration over REPA alone).
- Tools/products/workflows:
- A joint training recipe (yaml) exposing REPA loss weights and DAR router settings.
- Early stopping and model selection based on rapid FID drops at 100kā300k steps.
- Assumptions/dependencies: Availability of a suitable pretrained visual encoder for REPA; careful loss balancing; same compute budget constraints.
- Synthetic data generation with sharper details for vision model training
- Sectors: robotics, autonomous systems, industrial inspection, AR/VR
- What to do: Use DAR-enabled DiTs to generate higher-frequency, more realistic textures for synthetic datasets to train perception models (domain randomization or data augmentation).
- Tools/products/workflows:
- Dataset generators that toggle DAR for sharper photorealism; texture-focused validation (LPIPS, DISTS).
- Downstream re-training scripts measuring sim-to-real gains.
- Assumptions/dependencies: Synthetic-to-real transfer pipelines exist; safety checks for biased or unrealistic artifacts; validation on target tasks required.
- Academic diagnostics and curricula for cross-layer, timestep-aware analysis
- Sectors: academia, research engineering
- What to do: Adopt the paperās three diagnosticsāforward magnitude inflation, backward gradient decay, and block-wise redundancyājointly over depth and timestep to study and teach diffusion model internals.
- Tools/products/workflows:
- Lightweight āResidual Diagnosticsā toolkit to compute RMS magnitudes, gradient RMS, and inter-block cosine similarities vs t.
- Teaching labs demonstrating PreNorm dilution and timestep-adaptive routing.
- Assumptions/dependencies: Access to training runs and gradients; reproducible checkpoints.
- Product imagery and creative tooling with faster iteration cycles
- Sectors: e-commerce/retail, marketing, design studios
- What to do: Fine-tune or distill house T2I models with DAR to reach usable quality faster and retain crisp product edges, fabrics, or metallic textures important for conversion.
- Tools/products/workflows:
- Rapid A/B creative generation with DAR-enabled checkpoints; automated QA for moirƩ, edge sharpness, and text fidelity.
- Assumptions/dependencies: Existing T2I workflows and human-in-the-loop review; content moderation and watermarking retained.
- Practical chunk-size auto-tuning
- Sectors: software/AI infrastructure
- What to do: Default chunk size Sā4 (per the paperās theory and ablations) and auto-tune S with a short pilot run to balance memory, router precision, and compression.
- Tools/products/workflows:
- āChunkSizeAutoTunerā that fits a budget-aware objective and tries S in {2,4,6}.
- Assumptions/dependencies: Monitoring utilities; stable training configuration.
Long-Term Applications
These opportunities likely require further research, scaling studies, system co-design, or broader ecosystem changes before widespread deployment.
- Video, 3D, and multimodal diffusion with phase-aware routing
- Sectors: film/VFX, gaming, digital twins, CAD/3D content, multi-sensor robotics
- What to do: Extend DAR to temporal and spatial hierarchies in video or 3D diffusion (routing over depth and time/geometry); learn phase-aware routing policies for different denoising regimes.
- Tools/products/workflows:
- āSpatiotemporal DARā modules with temporal chunking and geometry-aware keys.
- Assumptions/dependencies: Large-scale training at high resolution and long contexts; careful memory management; new evaluation metrics beyond 2D FID.
- Adaptive compute at inference via router-driven depth skipping
- Sectors: mobile/edge AI, interactive creative tools, cloud serving
- What to do: Use DARās softmax weights to identify low-contribution depth regions and introduce dynamic block skipping or early-exit strategies for faster sampling.
- Tools/products/workflows:
- Inference-time āRouterPrunerā that thresholds routing weights to skip computations.
- Assumptions/dependencies: Research on accuracy/speed trade-offs; stable policies across prompts and timesteps; guardrails to avoid quality collapse.
- Hardwareācompiler co-design for depth-attentive Transformers
- Sectors: semiconductor, cloud platforms, systems research
- What to do: Co-optimize memory layouts and kernels for āattention over depthā (efficient storage/access of source sets and chunk summaries); fuse router computations with layer kernels.
- Tools/products/workflows:
- CUDA/Triton kernels for keyāquery matmuls across depth; activation tiling strategies.
- Assumptions/dependencies: Vendor support; cost/benefit validated at billion-parameter scale.
- Standards and policy for compute- and energy-efficient generative training
- Sectors: policy/government, sustainability programs, standards bodies
- What to do: Encourage reporting of energy-per-quality metrics (e.g., kWh per FID@N samples) and adoption of routing-based efficiency improvements in public RFPs and grants.
- Tools/products/workflows:
- Benchmarks including convergence-speed targets; model cards with routing diagnostics.
- Assumptions/dependencies: Community consensus on metrics; independent audits and reproducibility norms.
- Safety and governance responses to lower training barriers
- Sectors: policy, trust & safety, platform governance
- What to do: Because DAR reduces training cost/time for high-fidelity generators, expand red-teaming, watermarking, provenance (C2PA) integration, and content filters for models trained with DAR.
- Tools/products/workflows:
- Mandatory watermarking in DAR-based releases; router-weight anomaly checks during finetuning for misuse indicators.
- Assumptions/dependencies: Coordination with standards (C2PA), platform policies, and forensic tools.
- Cross-domain routing research beyond images (audio, speech, bio, geospatial)
- Sectors: speech/TTS, music generation, healthcare/bio, earth observation
- What to do: Investigate timestep- or noise-phaseāaware routing in 1D/structured modalities (e.g., denoising schedules in audio; phase-aware features in genomics).
- Tools/products/workflows:
- āDAR-1Dā modules; domain-specific diagnostics mirroring magnitude/gradient/redundancy analyses.
- Assumptions/dependencies: Domain-appropriate objectives and timesteps; datasets and task-aligned metrics.
- Improved distillation frameworks (few-step and one-step) that retain fine detail
- Sectors: consumer apps, on-device AI, enterprise design tools
- What to do: Pair DAR with next-gen distillation (e.g., improved LC/consistency models) to push quality at extremely low step counts.
- Tools/products/workflows:
- Distillers that route teacher features across depth/timesteps into student updates.
- Assumptions/dependencies: Stable training for 1ā4 step generators; robust CFG equivalents; extensive prompt coverage.
- Automated architecture search along the routing axis
- Sectors: AutoML/MLSys, research labs
- What to do: Treat routing (query parameterization, chunk size, source set topology) as a search space and optimize for compute-normalized quality.
- Tools/products/workflows:
- NAS pipelines that include routing primitives; meta-learning over denoising schedules.
- Assumptions/dependencies: Compute budgets for search; reliable early-stage quality predictors.
- Domain-specific adoption in healthcare imaging and scientific simulation
- Sectors: healthcare, materials science, climate modeling
- What to do: Use DAR to train synthetic or enhancement models that preserve diagnostically relevant high-frequency details while controlling hallucinations.
- Tools/products/workflows:
- Prospective studies with radiologist/physicist evaluation; uncertainty quantification overlays.
- Assumptions/dependencies: Regulatory approval, strict validation, bias and safety analysis; institutional review for synthetic data usage.
Summary of key dependencies across applications
- Technical: Ability to modify DiT blocks; memory overhead managed via chunked aggregation (Sā4 is a strong default); router adds small per-layer compute; convergence speed gains amortize overhead.
- Data/metrics: Access to suitable datasets; use of both classic (FID/IS) and task-specific metrics (edge acuity, LPIPS, precision/recall).
- Safety/compliance: Watermarking, provenance, and content filters remain necessary; domain approvals for healthcare/science.
- Ecosystem: Integration with training frameworks (PyTorch/JAX/Lightning), Diffusers-style APIs, and MLOps dashboards for diagnostics and energy reporting.
Glossary
- adaLN: Adaptive Layer Normalization used to modulate hidden states conditioned on timestep or content. "the query is computed from the current adaLN-modulated hidden state"
- adaLN-Zero: A zero-initialized variant of adaptive LayerNorm used for stable conditioning in DiTs. "through DiT's adaLN-Zero conditioning pathway"
- Attention Residuals: A routing scheme that replaces fixed residual addition with depth-wise softmax attention. "Drawing on the recently proposed Attention Residuals (AttnRes) framework~\citep{team2026attention},"
- backward gradient decay: The phenomenon where gradient magnitudes diminish with depth, hindering learning in deeper layers. "sharp backward gradient decay"
- block-wise redundancy: High similarity between consecutive blocksā outputs indicating redundant representations. "pronounced block-wise redundancy."
- classifier-free guidance (CFG): A sampling technique that steers generation using conditional vs. unconditional predictions without an auxiliary classifier. "with and without classifier-free guidance~\citep{ho2022classifier}"
- chunked aggregation: A memory-efficient routing approach that summarizes past sublayers in chunks for softmax aggregation. "The chunked aggregation in \S\ref{sec:DAR} exposes a single knob, the chunk size ,"
- counterfactual importance: A gradient-based measure estimating how a hypothetical router would reweight sources. "as a counterfactual importance of how a baseline-equivalent router would reweight each source if one existed"
- cross-attention: An attention mechanism that fuses tokens across modalities or sources (e.g., text-image). "retains conventional cross-attention"
- cross-layer information routing: Mechanisms that select and combine representations from different depths. "cross-layer information routing in DiTs"
- denoising timestep: The continuous time variable controlling noise level during diffusion; routing should adapt across it. "the denoising timestep --- the very dimension that distinguishes DiTs --- should play a vital role"
- DenseFormer: A Transformer variant with learned depth aggregation to enhance information flow. "learned depth aggregation in DenseFormer~\citep{pagliardini2024denseformer}"
- Diffusion-Adaptive Routing (DAR): A residual replacement that performs learnable, timestep-adaptive, non-incremental aggregation across layers. "we propose Diffusion-Adaptive Routing (DAR)"
- Diffusion Transformers (DiTs): Transformer-based denoisers for diffusion models that operate on tokenized latents. "Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation"
- Distribution Matching Distillation (DMD): A distillation method aligning student and teacher generative distributions. "during Distribution Matching Distillation."
- doubly stochastic constraints: Constraints ensuring rows and columns of a mixing matrix sum to one, stabilizing multi-stream propagation. "by imposing doubly stochastic constraints on the mixing"
- FrĆ©chet Inception Distance (FID): A metric comparing distributions of generated and real images via Inception features. "FrĆĀ©chet Inception Distance (FID;~\citep{heusel2017gans})"
- Inception Score (IS): A generative quality metric assessing confidence and diversity via the Inception classifier. "Inception Score (IS;~\citep{salimans2016improved})"
- isotropic and homogeneous Transformer stack: An architecture where layers share a uniform structure without handcrafted stage pairings. "preserving the isotropic and homogeneous Transformer stack"
- latent autoencoders: Encoders/decoders mapping images to and from compressed latent spaces for diffusion. "latent autoencoders ---"
- linear-probe diagnostic: A test using a simple linear model to assess whether a feature encodes a target variable (e.g., timestep). "We test this premise directly with a linear-probe diagnostic."
- ODE sampler: An ordinary-differential-equation-based sampler for deterministic diffusion/flow trajectories. "We use both ODE and SDE samplers"
- PreNorm dilution: A degradation where Pre-Norm residuals cause growing activations and vanishing gradients with depth. "PreNorm dilution phenomenon"
- rate-distortion model: A framework trading off compression rate and reconstruction fidelity used to analyze chunked aggregation. "Under a mild rate-distortion model,"
- rectified-flow training: A flow-based generative training approach that straightens trajectories for faster sampling. "rectified-flow training at scale"
- REPA: A training method that aligns DiT hidden states with pretrained visual representations to accelerate learning. "REPA~\citep{yu2025repa} accelerates DiT training"
- representation-alignment objective: A loss that encourages model features to match external pretrained representations. "introducing a representation-alignment objective"
- residual stream: The accumulated pathway that carries and adds each layerās outputs across the network. "The residual stream that governs how information accumulates across layers"
- ridge regressor: A linear regression with L2 regularization, used here to decode timestep from features. "fit a ridge regressor"
- RMSNorm: Root-Mean-Square Layer Normalization that normalizes activations by their RMS. ""
- sFID (spatial FrĆ©chet Inception Distance): A spatial variant of FID that emphasizes local structure. "spatial FrĆĀ©chet Inception Distance (sFID;~\citep{nash2021generating})"
- SDE sampler: A stochastic-differential-equation-based sampler for probabilistic diffusion trajectories. "We use both ODE and SDE samplers"
- SiT: Scalable Interpolant Transformer framework that unifies diffusion- and flow-based generative objectives. "SiT~\citep{ma2024sit} unifies diffusion- and flow-based objectives"
- softmax attention: A weighted aggregation mechanism using softmax over similarity scores (queries and keys). "a softmax attention over preceding sublayer outputs"
- source-mixing patterns: The distribution of weights over historical layer outputs selected by the router. "Source-mixing patterns across denoising timesteps."
- timestep embedding: A learned vector injected to condition layers on the current diffusion time. "the timestep embedding "
- U-Net-style long skip connections: Long-range connections linking shallow and deep layers to fuse multi-scale features. "U-Net-style long skip connections"
- U-ViT: A ViT-based diffusion architecture that treats noisy patches, timesteps, and conditions as tokens with skip connections. "U-ViT shows that noisy image patches, timesteps, and conditions can be treated as tokens"
- velocity-prediction MSE: The mean squared error objective for predicting the diffusion velocity field. "the velocity-prediction MSE used for SiT training"
Collections
Sign up for free to add this paper to one or more collections.