Selective Underfitting in Diffusion Models (2510.01378v1)

Published 1 Oct 2025 in cs.LG

Abstract: Diffusion models have emerged as the principal paradigm for generative modeling across various domains. During training, they learn the score function, which in turn is used to generate samples at inference. They raise a basic yet unsolved question: which score do they actually learn? In principle, a diffusion model that matches the empirical score in the entire data space would simply reproduce the training data, failing to generate novel samples. Recent work addresses this question by arguing that diffusion models underfit the empirical score due to training-time inductive biases. In this work, we refine this perspective, introducing the notion of selective underfitting: instead of underfitting the score everywhere, better diffusion models more accurately approximate the score in certain regions of input space, while underfitting it in others. We characterize these regions and design empirical interventions to validate our perspective. Our results establish that selective underfitting is essential for understanding diffusion models, yielding new, testable insights into their generalization and generative performance.

Summary

The paper demonstrates that underfitting in diffusion models is selective, with accurate score estimation in the supervision region and significant deviation during inference.
It introduces a theoretical framework using denoising score matching, employing metrics like the Bhattacharyya coefficient and r_star to quantify training and inference discrepancies.
The authors propose Perception-Aligned Training (PAT) to balance supervision and extrapolation, highlighting architectural and regularization strategies for improved generative performance.

Selective Underfitting in Diffusion Models: A Technical Analysis

Introduction

This paper introduces and formalizes the concept of selective underfitting in diffusion models, challenging the prevailing view that underfitting of the empirical score function occurs uniformly across the data space. The authors demonstrate that diffusion models are supervised only within a highly restricted region of the data space during training, and that underfitting is concentrated in regions outside this supervision region—specifically, the regions traversed during inference. This selective underfitting is shown to be essential for understanding both generalization and generative performance in diffusion models.

Theoretical Framework: Supervision and Extrapolation Regions

Diffusion models are trained via denoising score matching (DSM), which aims to learn the score function—the gradient of the log density of the data distribution convolved with noise. The DSM objective is minimized over noisy versions of the training data, which, in high-dimensional spaces, concentrate on thin spherical shells around each data point. The analytic minimizer of the DSM loss is the empirical score function, which, due to the concentration of the training distribution, reduces to a trivial form: the score points directly back to the nearest training data.

The authors rigorously prove that the training distribution is highly restricted, with the vast majority of training samples lying within these non-overlapping shells (the supervision region). The Bhattacharyya coefficient is used to quantify the negligible overlap between shells for most timesteps, confirming the effective isolation of supervision regions.

Extrapolation During Inference

Empirical analysis reveals that, during inference, denoising trajectories rapidly leave the supervision region and enter the extrapolation region—areas of the data space where the model receives no direct supervision. The authors introduce a quantitative metric ( $r_\star$ ) to measure the distance of inference samples from the nearest supervision shell, showing that $r_\star$ increases sharply during inference, indicating early and persistent extrapolation.

Selective Underfitting: Empirical Evidence

The central claim is that underfitting of the empirical score function is selective: as model capacity increases, the learned score approaches the empirical score within the supervision region (no underfitting), but deviates further in the extrapolation region (increased underfitting). This is validated through experiments on ImageNet using SiT-XL and other architectures, where the score error $\|s_\theta - s_\star\|^2$ is measured separately in both regions. The results show a clear dichotomy: memorization of training data in the supervision region and generalization (or creative generation) in the extrapolation region.

The phenomenon persists even for simple ground-truth distributions (e.g., Gaussian), indicating that selective underfitting is not merely a consequence of data complexity or neural network smoothness, but a fundamental property of the training procedure.

Generalization Mechanism: Freedom of Extrapolation

The paper advances the freedom of extrapolation hypothesis: the ability of a diffusion model to generalize is directly tied to the size of the supervision region. Enlarging the supervision region constrains the model's freedom to extrapolate, leading to increased memorization and reduced generalization. Controlled experiments, where the support of the training distribution is varied independently of the empirical score, confirm that models with restricted supervision regions generalize better, while those with expanded regions tend to memorize.

This mechanism is structurally analogous to benign overfitting in supervised learning, but with the critical distinction that the training and inference distributions are fundamentally different in diffusion models.

Generative Performance: Decomposed Analysis and Scaling Laws

The authors propose a decomposed framework for analyzing generative performance, separating it into supervision loss (fit to the empirical score in the supervision region) and an extrapolation function (mapping supervision loss to sample quality, e.g., FID). This decomposition enables principled comparison of training recipes and architectures.

Empirical results show that architectural choices (e.g., U-Net vs. Transformer) and regularization strategies (e.g., REPA) affect extrapolation efficiency and supervision efficiency differently. For instance, convolutional architectures yield more efficient extrapolation but less efficient supervision, while transformers scale better with compute but have less favorable extrapolation properties. The framework explains the empirical success of hybrid architectures and representation alignment methods.

Perception-Aligned Training (PAT): Unified Principle

The paper introduces Perception-Aligned Training (PAT) as a unified hypothesis for designing training recipes that induce favorable extrapolation behavior. PAT encompasses three main strategies:

Aligning diffusion space: Training in perceptually meaningful latent spaces.
Aligning representation: Regularizing internal representations to match perceptual features.
Aligning architectural bias: Designing architectures with inductive biases that reflect perceptual invariances.

Experiments on both real and synthetic data demonstrate that PAT leads to perceptually superior samples by improving extrapolation, even when supervision is held constant.

Implementation Considerations

Training and Evaluation

Models are trained on large-scale datasets (e.g., ImageNet) using standard DSM objectives, with careful separation of supervision and extrapolation regions for analysis.
FLOPs and FID are used as primary metrics for scaling and performance evaluation.
Importance sampling and efficient nearest-neighbor search (e.g., Faiss) are employed for experiments involving variable supervision region support.

Architectural Trade-offs

Transformer-based models offer better supervision efficiency at scale, while convolutional models provide superior extrapolation efficiency.
Hybrid architectures and representation alignment (e.g., REPA) can balance these trade-offs for improved generative performance.

Practical Implications

Training recipes should be designed to maximize freedom of extrapolation without excessively enlarging the supervision region.
PAT principles can be applied to latent space design, representation regularization, and architectural choices to enhance sample quality and generalization.

Implications and Future Directions

The selective underfitting framework provides a new lens for understanding generalization and sample quality in diffusion models, moving beyond explanations based solely on global inductive bias. The decomposed analysis of generative performance and the PAT principle offer actionable guidance for model and training recipe design.

Future research should focus on:

Theoretical characterization of extrapolation mechanisms in high-dimensional generative models.
Development of training algorithms that optimally balance supervision and extrapolation efficiency.
Extension of selective underfitting analysis to other classes of generative models (e.g., flow-based, autoregressive).

Conclusion

Selective underfitting is established as a central principle for understanding diffusion models, with profound implications for generalization, memorization, and generative performance. The decomposed analysis framework and PAT hypothesis unify a broad range of successful training strategies, providing a foundation for future advances in generative modeling.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper studies how diffusion models (AI systems that generate images, videos, or audio) actually learn during training and how that affects the pictures they create later. The authors introduce a new idea called “selective underfitting.” In simple terms, they show that diffusion models learn very accurately in some parts of their input space but not in others. This helps explain why these models can create new images instead of just copying their training data.

Key Questions

The paper focuses on a few clear questions:

Where, exactly, do diffusion models learn during training?
Why do they produce new images at inference (sampling) time instead of just reproducing the training set?
How does the size of the “supervised” area affect creativity (generalization) versus memorization?
How do different training choices (like architecture or extra regularization) change the model’s ability to turn training into good generation?

Methods and Approach (explained simply)

Think of training diffusion models like teaching a robot to “clean” noisy pictures until they become clear. The robot learns a “score function,” which is basically a direction arrow that tells it how to push a noisy image toward a more realistic image.

Here’s the key twist the authors explain:

During training, the robot sees noisy versions of real images. In high-dimensional spaces (like big image tensors), these noisy versions cluster on thin shells around each training image. Imagine each training image sitting inside a doughnut-shaped shell of noisy points around it. The model learns arrows that point from shell points straight back to the center—the original training image.
During inference (when the model creates new images), the robot starts with pure noise and follows learned arrows to clean it up step by step. The authors show that very early in this journey, the robot leaves those doughnut shells and wanders into regions that weren’t directly supervised during training. In other words, the model must “extrapolate”—make smart guesses—outside the area it was taught.

To paper this, the authors:

Defined a “supervision region” (the set of thin shells around each training image) and an “extrapolation region” (areas outside those shells).
Measured how close the learned arrows are to the perfect arrows inside the shells versus outside them.
Ran scaling experiments with different model sizes.
Tested what happens if they artificially change how big the supervision region is.
Analyzed generative performance (like FID scores) by splitting it into two parts: how well the model fits the supervised arrows and how well that fit translates to good images through extrapolation.
Checked the effects of specific training choices (like REPA regularization and different architectures).

Analogy: Learning inside the shells is like practicing driving in a quiet parking lot. Extrapolation is like driving on real roads with traffic. The model gets very good in the parking lot, but great generation depends on how well it handles the real roads.

Main Findings and Why They Matter

Here are the main results, described in everyday language:

Selective underfitting: The model fits the “arrow directions” very well inside the supervision region (the shells), but it underfits (differs more) outside these regions. As models get bigger and better, they improve inside the shells but underfit more outside. This selective pattern is key to how they generate new content.
Inference leaves the training region early: When generating an image, the model quickly moves out of the shells where it was trained. So most of generation relies on extrapolation—not on direct supervision.
Memorization inside the shells: If you start from a noisy version right on those shells, the model often snaps back to the original training image. This shows it learns the in-shell directions almost perfectly—which explains why some timesteps behave like “memorization,” while other timesteps lead to new images.
Freedom of extrapolation helps creativity: If you make the supervision region bigger (so the model is directly trained in more places), the model tends to memorize more and generalize less. Giving the model more “freedom” outside the supervised area encourages it to create new, diverse samples.
Explaining a real-world paradox (Classifier-Free Guidance): During training, the “conditional” and “unconditional” score arrows look almost the same inside the shells. But during inference they behave differently. The paper explains this by noting that their supervision happens over different regions, so their extrapolation differs—and that’s why guidance works at sampling time.
Decomposed view of generative performance: The authors propose a simple rule of thumb: FID (image quality) depends on two parts: the supervision loss (how well the model fits in-shell arrows) and an “extrapolation function” (how effectively that good fit turns into good images outside the shells). This helps compare training methods more fairly.
REPA helps extrapolation: Adding REPA (a representation alignment regularizer) barely changes the in-shell behavior but substantially changes out-of-shell behavior, improving image quality. So REPA’s gains come mostly from better extrapolation.
Architecture trade-offs: Convolution-heavy U-Nets often have better extrapolation efficiency (they turn a given level of supervision into better FID) but can be less efficient at reducing supervision loss for a fixed compute budget. Transformers tend to be better at rapidly reducing supervision loss as you scale compute, which is why they are popular at large scale. Combining strengths (good extrapolation and good supervision efficiency) can be powerful.
Perception-Aligned Training (PAT): The authors propose a simple guiding idea: train the model so that inputs that look similar to humans produce similar score outputs. When your training space, features, or architecture better match human perception, extrapolation tends to produce better-looking images. They outline three ways to align with perception:
- Align the training space (use perceptually meaningful latent spaces).
- Align internal representations (like REPA).
- Align the network’s bias (e.g., using convolutions to respect image locality and translation).

Conclusion and Impact

This paper changes how we think about diffusion models. Instead of imagining they learn everywhere equally, it shows they learn very well inside a narrow supervised region and rely heavily on extrapolation outside it. That insight:

Explains how these models can both memorize (inside the shells) and generate creatively (outside the shells).
Suggests that making the supervised zone too large can hurt creativity.
Provides a practical way to analyze and improve training: split performance into supervision and extrapolation, and design methods (like PAT) that align the model with human perception to get better extrapolation.
Helps researchers and engineers choose architectures and training tricks more wisely, aiming for the best balance between learning the supervised arrows and turning that learning into great images.

In short, selective underfitting and the supervision-versus-extrapolation view give a clearer roadmap for building more powerful, creative, and efficient diffusion models.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a single, concrete list of knowledge gaps, limitations, and open questions left unresolved by the paper. Each item is intended to be specific and actionable for future research.

Formal analytic characterization of the extrapolation region: precisely define and quantify the set of inputs where inference queries the model outside the supervision shells, including its dependence on timestep, noise schedule, ambient and manifold dimensions, and dataset geometry.
Theory for selective underfitting: derive conditions (on architecture class, optimizer, regularization, data distribution, and dimensionality) under which DSM-trained networks provably fit the empirical score inside the supervision region while underfitting outside; provide bounds on the deviation as a function of model capacity and training compute.
Mechanism shaping the extrapolated score: identify which inductive biases (e.g., smoothness, convolutional equivariance, attention locality), optimization dynamics, or representation constraints determine the learned score in the extrapolation region; design controlled interventions to isolate their causal effects.
CFG paradox formalization: develop a predictive model linking the geometry of conditional vs unconditional supervision regions to divergence of scores at inference; characterize how guidance weight, class distribution, and region overlap quantitatively affect CFG’s effectiveness.
Robust estimation of supervision-region overlap on real data: move beyond illustrative Bhattacharyya coefficients by providing statistically sound, scalable overlap measures across datasets, resolutions, and latent spaces; analyze sensitivity to dataset size N, ambient dimension d, and encoder choice.
Dataset scale effects: establish how increasing N (and diversity) alters shell overlap, empirical score collapse, and the timing/extent of extrapolation during inference; propose scaling laws with respect to N and data heterogeneity.
Generality across domains and training paradigms: validate selective underfitting beyond ImageNet and SiT (e.g., text-to-image, audio, video, 3D), and across diffusion vs flow-matching training, v-parameterization, and alternative forward processes.
Sampler and guidance dependence: systematically test whether leaving the supervision region early is inherent or contingent on sampler choice (DDPM vs DDIM vs higher-order samplers), step counts, noise schedules, and CFG strength; quantify the effect sizes.
Statistical rigor and reproducibility: report confidence intervals, multiple seeds, and hypothesis tests for key findings (r⋆ distributions, memorization ratios, FID–loss regressions); document hyperparameters, δ choices, approximation details, and code to enable replication.
Precise operational metrics for “memorization” and “generalization”: standardize definitions beyond regression to the nearest training image (e.g., novelty scores, precision/recall for generative models, coverage, nearest-neighbor distances), and validate them across datasets.
Scalability of empirical score computation: clarify how the analytic score s⋆ is computed at ImageNet scale (full dataset vs subsampling/approximation), quantify induced bias/variance, and develop efficient estimators with provable error bounds.
Selective underfitting under Gaussian ground truth: provide a formal explanation (or stronger empirical evidence) for why selective underfitting persists when the true score is linear; characterize when global fitting fails despite linearity.
Functional form of the extrapolation function f_extrapolation: move beyond empirical linear fits to derive or test theoretical forms mapping supervision loss L to FID; establish invariance (or lack thereof) across recipes, datasets, and scales, and identify confounders.
Causal disentanglement of supervision vs extrapolation: design interventions that change extrapolation behavior without altering supervision loss (and vice versa) to identify causal pathways from training recipe to generative performance.
Formalization of PAT: define a perceptual metric d_perc, state alignment criteria for outputs sθ(z,t), and develop quantitative tests that isolate perception alignment effects from other regularizers; provide theoretical guarantees or counterexamples.
Negative controls and failure modes for PAT: identify scenarios where perception alignment does not improve extrapolation or harms supervision (e.g., misaligned encoders, domain mismatch), and characterize the trade-offs and boundary conditions.
Algorithms to balance supervision and extrapolation: propose and evaluate training methods (e.g., adaptive region sizing, loss reweighting over timesteps/inputs, curriculum schedules) that explicitly target a desired generalization–memorization trade-off.
Privacy and safety implications: quantify how selective underfitting and region sizing affect memorization risk (membership inference, training data leakage), and develop mitigations that preserve generative performance.
Robustness to distribution shift: test whether extrapolated scores generalize under out-of-distribution prompts/styles and across domains; measure stability and failure cases of extrapolation under shift.
Role of data augmentations: analyze how augmentations (cropping, translation, color jitter) reshape the supervision region and extrapolation freedom; provide guidelines for augmentation policies that enhance generalization without increasing memorization.
Dissecting representation alignment (REPA and beyond): identify which feature properties (semantic abstraction vs low-level invariances) most improve f_extrapolation; compare alternative encoders and alignment objectives; measure their effect sizes.
Architecture-level paper: systematically map the continuum between U-Net and transformer designs under fixed compute and matched supervision loss; decouple supervision efficiency (FLOPs → L) from extrapolation efficiency (L → FID) to isolate architectural contributions.
Formal link between overlap coefficient C(t) and phase transitions: derive the relationship between shell overlap, empirical score collapse, and the probability of regressing to training samples across timesteps; explain observed “phase transitions” quantitatively.
Alternative training objectives: test whether SNR-weighted losses, timestep reweighting, or alternative score parameterizations alter selective underfitting, extrapolation behavior, and the FID–L relationship; identify principled recipes.
Conditional generation geometry: extend selective underfitting analysis to text-conditioning, modeling how text embeddings shape supervision regions and extrapolation; paper implications for prompt adherence vs sample diversity.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are deployable applications that translate the paper’s selective underfitting perspective into concrete tools, workflows, and decisions across sectors.

Supervision–Extrapolation Diagnostics Toolkit
- Sector: software/MLOps, academia, model evaluation
- Tools/Workflow: implement r⋆ distance tracking to the supervision region, overlap (Bhattacharyya) coefficient across timesteps, per-timestep memorization ratio, and contrastive scaling (‖sθ − s⋆‖ in supervision vs extrapolation); fit the supervision–performance decomposition FID ≈ f_extrapolation(L) to compare training recipes; integrate in training dashboards and CI
- Assumptions/Dependencies: access to training-data statistics or proxies; additional compute for logging; ability to approximate s⋆ (exactly available only when training data are known or for synthetic tasks)
Recipe selection with supervision–extrapolation decomposition
- Sector: generative AI products (images/video/audio), research labs
- Tools/Workflow: monitor supervision loss L and the induced f_extrapolation curve for each recipe (baseline vs REPA vs architecture variants); choose training recipes that minimize FID for a fixed L (better extrapolation efficiency) or minimize L per FLOP (better supervision efficiency); adopt REPA-like representation alignment when extrapolation efficiency is the bottleneck
- Assumptions/Dependencies: consistent evaluation protocol; availability of pre-trained perceptual encoders when using representation alignment
Practical “creativity vs faithfulness” knob in products
- Sector: creative tools, consumer apps, enterprise content generation
- Tools/Workflow: expose a control that modulates extrapolation freedom at inference (e.g., altered guidance, early escape from supervision region, or sampling schedules); enable users to choose between novel outputs vs faithful reconstructions/style adherence
- Assumptions/Dependencies: safe ranges must be tuned to avoid artifacts; user experience design to make the trade-off intuitive
Memorization risk audits and compliance reports
- Sector: policy/legal, creative industries, finance, advertising
- Tools/Workflow: report memorization ratio, r⋆ distribution during inference, and overlap metrics; document how recipe choices (subset selection, augmentation, noise schedule) alter supervision region size and regurgitation risk
- Assumptions/Dependencies: legal definitions of “memorization” vary; requires ground-truth or near-duplicates search; access to (or hashes of) training sets increases audit fidelity
Data curation for originality
- Sector: media, marketing, entertainment, foundation model training
- Tools/Workflow: separate “score” vs “region” subsets (as in the paper’s experiment) or use strong augmentations/noise schedules to shrink effective supervision region; target higher freedom of extrapolation to increase novelty
- Assumptions/Dependencies: exact s⋆ requires knowledge of training data; in practice, use proxies (e.g., curated subsets, augmentation regimes) to alter supervision region without explicit s⋆
Architecture guidance under compute constraints
- Sector: model development, platform providers
- Tools/Workflow: leverage trade-offs shown in the decomposition—transformers tend to have better supervision efficiency (FLOPs → L), while U-Nets often have better extrapolation efficiency (L → FID); select/hybridize based on target and budget; track efficiency curves rather than only headline FID
- Assumptions/Dependencies: results generalize best to image diffusion; actual trade-offs vary by dataset/scale
Domain-specific perception-aligned training (PAT) now
- Sector: healthcare imaging, industrial vision, audio/music, 3D/graphics
- Tools/Workflow: plug in domain encoders to REPA-like objectives; train in perceptually aligned spaces (e.g., domain-specific VAEs) to improve extrapolation quality without large supervision changes; for audio, use psychoacoustic embeddings; for medical imaging, clinician-aligned features
- Assumptions/Dependencies: availability and licensing of high-quality domain encoders; careful validation to avoid clinically unsafe hallucinations
CFG-aware conditioning and guidance redesign
- Sector: diffusion-based content generation
- Tools/Workflow: recognize that conditional vs unconditional differences stem from extrapolation, not training-time score differences; adjust guidance schedules or conditional data composition so that supervision regions are better matched to the target deployment; experiment with guidance that explicitly shapes extrapolation region early in sampling
- Assumptions/Dependencies: requires instrumentation to observe score differences along trajectories; product-specific tuning
MLOps metrics beyond FID
- Sector: ML platforms, internal evaluation
- Tools/Workflow: routinely track supervision loss, extrapolation distance statistics, and f_extrapolation fits per model/version; use them to triage regressions where FID is unchanged but safety/novelty risks shift
- Assumptions/Dependencies: metric standardization and buy-in from teams

Long-Term Applications

The items below need further research, scaling, or standardization before broad deployment.

Standards for memorization and extrapolation audits
- Sector: policy/regulation, industry consortia
- Tools/Workflow: define audit protocols (memorization ratio thresholds, r⋆ trajectory limits, reporting of region-size controls); require disclosures on recipe choices that affect extrapolation freedom
- Assumptions/Dependencies: community consensus on metrics and safe thresholds; regulatory processes and third-party audit infrastructure
Privacy-preserving diffusion training via region control
- Sector: healthcare, finance, enterprise AI
- Tools/Workflow: incorporate supervision-region management with differential privacy or regularization to reduce PII leakage while maintaining utility; certify reduced regurgitation risk using the paper’s diagnostics
- Assumptions/Dependencies: formal guarantees for selective underfitting dynamics; domain-specific utility–privacy trade-off studies
Safety-critical extrapolation bounds
- Sector: autonomous systems simulation, scientific imaging, legal evidence generation
- Tools/Workflow: develop curricula/noise schedules and controllers that constrain inference trajectories to validated regions; add monitors that flag when sampling exits approved regions
- Assumptions/Dependencies: rigorous characterization of “approved” regions; acceptance criteria for domain risks
New training curricula and timesteps/schedules
- Sector: foundation model training
- Tools/Workflow: design noise/variance schedules that optimize freedom of extrapolation without destabilizing training; teacher–student or curriculum strategies that progressively shape extrapolation regions
- Assumptions/Dependencies: robust scaling laws for different data regimes; compute to iterate on curricula
Hybrid backbones combining supervision and extrapolation efficiency
- Sector: model architecture R&D
- Tools/Workflow: develop architectures that inherit transformers’ supervision efficiency and U-Nets’ extrapolation efficiency (e.g., hybrid conv-attention blocks, flow-dilated convs); validate via f_extrapolation comparisons
- Assumptions/Dependencies: engineering complexity; large-scale ablations to confirm gains across domains
Generalization theory and benchmarks for selective underfitting
- Sector: academia and applied research
- Tools/Workflow: formalize the analogy to benign overfitting; create domain-spanning benchmarks that score supervision-region size vs generative novelty/quality
- Assumptions/Dependencies: community adoption; reproducible access to training data or high-fidelity proxies
Content provenance and IP risk tooling
- Sector: media, legal/compliance
- Tools/Workflow: build model-level attribution tools that flag outputs likely produced from inside narrow supervision regions; integrate into review pipelines for broadcast/publishing
- Assumptions/Dependencies: robust near-duplicate detection; access to hashes or fingerprints of training corpora
Domain-specific PAT libraries
- Sector: developer ecosystem, open-source
- Tools/Workflow: packaged modules for perception-aligned spaces, representation alignment losses, and architecture presets per domain; end-to-end recipes that expose the supervision–extrapolation decomposition
- Assumptions/Dependencies: maintenance of domain encoders; licensing and community support
Adaptive, evaluator-in-the-loop creativity control
- Sector: consumer and enterprise content generation
- Tools/Workflow: real-time modulation of sampling paths and extrapolation freedom based on feedback from human preference models or downstream task evaluators
- Assumptions/Dependencies: reliable evaluators; latency budgets; safety guardrails
Cross-modal extensions (text, code, tabular)
- Sector: software development, documentation, data synthesis
- Tools/Workflow: adapt selective underfitting diagnostics and PAT to diffusion variants for code/audio/text; target reduced memorization of licensed code and improved novelty of templates
- Assumptions/Dependencies: modality-appropriate perceptual encoders and evaluation metrics; alignment with license and compliance policies

Notes on global feasibility:

Many diagnostics assume access to training distributions or strong proxies; black-box or proprietary data limit precision.
FID is not universally reliable; domain-appropriate quality metrics are needed for the decomposition.
“Perception” is domain-specific; PAT’s benefits hinge on high-quality encoders and robust latent spaces.
The findings are grounded in diffusion/flow models; portability to other generative paradigms (e.g., autoregressive transformers) requires adaptation.

View Paper Prompt View All Prompts

Glossary

Analytic score function: A closed-form expression for the score of the empirical training distribution; in this paper, it is synonymous with the empirical score function computed from the training set. "Since this is the score function of the empirical training data, we refer to $s_\star$ as the empirical score function, often also called the analytic score function."
Benign overfitting: The phenomenon where models interpolate training data yet still generalize due to implicit biases of training procedures. "Benign overfitting \citep{bartlett2020benign,zou2021benign,frei2023benign} in classical supervised learning posits that implicit training biases (e.g., from the optimizer or regularization) guide the learner toward a “favorable” interpolating solution—one that attains zero training loss that generalizes to test data."
Bhattacharyya coefficient: A measure of overlap between probability distributions, used here to quantify the overlap of Gaussian components across timesteps. "We quantify this separation using the Bhattacharyya coefficient~(\cite{bhattacharyya1943divergence}, \Cref{app:theory_overlap})—a standard measure of distributional overlap—computed on the ImageNet dataset~\citep{deng2009imagenet} and in the latent space of the SD-VAE~\citep{rombach2022high}."
Classifier-Free Guidance (CFG): A sampling technique that combines conditional and unconditional score estimates with a guidance weight to improve sample quality. "CFG works because the difference between conditional and unconditional scores is recognizable; if these scores are identical, CFG reduces to standard conditional sampling."
Denoising score matching (DSM): A training objective where the model learns the score (gradient of log-density) by predicting the noise added to data, enabling denoising-based generation. "The DSM objective at timestep $t$ is:"
Diffusion models: Generative models that learn to reverse a noise-adding process by estimating score functions over time to synthesize samples. "Diffusion models have emerged as the principal paradigm for generative modeling across various domains."
Diffusion transformers: Transformer-based architectures adapted for diffusion modeling. "This framework not only clarifies the benefits of recent advances such as REPA~\citep{yu2024representation} and diffusion transformers~\citep{peebles2023scalable,ma2024sit}, but also suggests a general principle for designing better training recipes."
Empirical score function: The score (gradient of log-density) of the empirical training distribution, often computable in closed form for the DSM objective. "Since this is the score function of the empirical training data, we refer to $s_\star$ as the empirical score function, often also called the analytic score function."
Extrapolation function: A mapping that characterizes how well improvements in supervised score fitting translate into generative quality (e.g., FID). "In words, generative performance is decomposed into two components: (i) supervision loss $L$ , which measures how well the empirical score is fitted by the model, and (ii) extrapolation function $f_\text{extrapolation}$ , which characterizes the extrapolation behavior of the training recipe."
Extrapolation region: Parts of input space where the model is queried at inference without direct training supervision. "This observation prompts us to distinguish two distinct regions in the data space: the xblue{supervision region}, where the model is supervised to approximate the empirical score during training, and the xred{extrapolation region}, where the model receives no supervision but is queried at inference."
FID (Fréchet Inception Distance): A metric for evaluating generative model quality by comparing feature distributions between generated and real data (lower is better). "Throughout, we use FID (lower is better, \cite{heusel2017gans}) as the primary metric for measuring generative performance."
Flow matching: A training paradigm equivalent to diffusion in this context, which learns vector fields to transform distributions. "Throughout this work, we use “diffusion” as an umbrella term for both diffusion and flow matching~\citep{lipman2022flow,liu2022flow}, as they are equivalent~\citep{gao2025diffusion}."
Freedom of Extrapolation: The hypothesis that more “room” to extrapolate outside supervised regions improves generative generalization. "We refer to this intuition as Freedom of Extrapolation: the more freedom the model has to extrapolate beyond the supervision region, the better its generalization tends to be."
Mixture of Gaussians: A probabilistic model representing a distribution as a weighted sum of Gaussian components; here, the noisy training distribution at each timestep. "a model is trained on noisy versions of the training data, which can be written as a mixture of Gaussians: $\hat{p}_t(x_t) = \frac{1}{N}\sum_{i=1}^{N}\mathcal{N}(x_t;\alpha_t x^{(i)}, \sigma_t^2 I)$ ."
Perception-Aligned Training (PAT): A proposed principle that encourages models to align outputs for perceptually similar inputs to improve extrapolation. "we propose a unified hypothesis: Perception-Aligned Training (PAT)."
REPA: A method that adds representation-alignment regularization to DSM training, improving generative quality largely via better extrapolation behavior. "REPA augments the standard DSM loss with an additional regularization term that aligns the model's intermediate representations with features from a pretrained encoder."
Sampling trajectories: The sequence of intermediate states produced during iterative denoising at inference. "At inference, #1{xred}{sampling trajectories} go beyond this region, where predictions are not directly xxgreen{supervised} and must be xsienna{extrapolated}."
Scaling law: An empirical relationship linking model/training scale (or supervision loss) to performance metrics such as FID. "we introduce a quantitative scaling law framework for analyzing generative performance (\Cref{sec:scaling-analysis})."
Score function: The gradient of the log-density of a distribution; learning this function enables denoising and sample generation. "During training, they learn a score function--the gradient of the log density of the data distribution convolved with noise--by reconstructing data that has been corrupted with Gaussian noise, a process called denoising score matching"
Selective underfitting: The paper’s central phenomenon: models fit the empirical score well in supervised regions but underfit it elsewhere, which is crucial for generalization. "we refine this perspective, introducing the notion of selective underfitting: instead of underfitting the score everywhere, better diffusion models more accurately approximate the score in certain regions of input space, while underfitting it in others."
Softmax: A function converting scores into a probability distribution; here, it weights training points in the analytic score expression and becomes highly peaked. "making the softmax weights in \Cref{eq;empirical_score} extremely imbalanced: nearly $1$ on $x^{(i)}$ and nearly $0$ on all others."
Stochastic forward process: The noise-adding process that progressively transforms data into noise during training of diffusion models. "Diffusion models define a stochastic forward process that transforms a data distribution into a Gaussian distribution over timesteps $t \in [0, 1]$ "
Supervision loss: The part of the objective that measures how well the model matches the empirical score on the supervised region. "This supervision loss typically correlates with generative performance: as model size increases, both supervision loss and FID decrease."
Supervision region: The subset of data space (thin shells around training examples) where the model receives direct supervision during training. "We call $\mathcal{T}_t(\delta)$ the supervision region: the union of thin spherical shells around the data points that, taken together, contain $x_t \sim \hat{p}_t$ with high probability."
U-Net: A convolutional neural network architecture with encoder-decoder skip connections; compared here to transformers for extrapolation and supervision efficiency. "Transformer vs.\ U-Net.~We next examine the impact of architecture--an important axis of the training recipe--by comparing transformer (attention), U-Net (convolution), and U-Net (attention+convolution) models."

View Paper Prompt View All Prompts

Open Problems

Mechanism Behind Extrapolation in Diffusion Models

Continue Learning

Authors (6)

Collections

Tweets

This paper has been mentioned in 14 tweets and received 267 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

alphaXiv

Selective Underfitting in Diffusion Models (32 likes, 0 questions)

Selective Underfitting in Diffusion Models (2510.01378v1)

Summary

Selective Underfitting in Diffusion Models: A Technical Analysis

Introduction

Theoretical Framework: Supervision and Extrapolation Regions

Extrapolation During Inference

Selective Underfitting: Empirical Evidence

Generalization Mechanism: Freedom of Extrapolation

Generative Performance: Decomposed Analysis and Scaling Laws

Perception-Aligned Training (PAT): Unified Principle

Implementation Considerations

Training and Evaluation

Architectural Trade-offs

Practical Implications

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

Methods and Approach (explained simply)

Main Findings and Why They Matter

Conclusion and Impact

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Tweets

alphaXiv