Adaptive Matching Distillation (AMD)

Updated 4 July 2026

The paper introduces AMD, a few-step generative distillation method that detects and corrects Forbidden Zones in teacher guidance.
It utilizes dynamic signal decomposition and weighted fake-teacher repulsion to adaptively balance semantic conditioning and corrective distribution matching.
Empirical results on image and video benchmarks demonstrate enhanced fidelity, reduced collapse, and improved reward scores compared to DMD baselines.

Adaptive Matching Distillation (AMD) is a few-step generative distillation method introduced for image and video generation as a reward-aware, self-correcting extension of Distribution Matching Distillation (DMD). It is designed for the regime in which a fast student generator is distilled from a pre-trained diffusion “real teacher” while a concurrently trained “fake teacher” regularizes the student’s evolving distribution. The central claim is that conventional DMD becomes unstable in “Forbidden Zones,” regions where the real teacher provides unreliable guidance and the fake teacher supplies insufficient repulsive force; AMD addresses this by explicitly detecting such regions with reward proxies, decomposing guidance into structurally distinct components, and sharpening the fake teacher’s repulsive landscape (Bai et al., 7 Feb 2026).

1. Problem setting and DMD formalism

Few-step generative models seek to compress a many-step diffusion sampling process into a small number of steps while preserving fidelity and diversity. In the formulation used by AMD, the student generator is denoted $G_\theta$ , with latent input $z \sim \mathcal{N}(0, I)$ and sample $x = G_\theta(z)$ . A diffusion time is sampled as $t \sim U[0,1]$ , and the forward re-noising operator is

$F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$

DMD optimizes a contrastive objective defined on noisy states $F_t(x)$ under a fixed real-teacher distribution and a learned fake-teacher distribution:

$L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$

Its score-matching gradient is

$\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$

The real teacher provides attractive guidance that pulls student samples toward the target data distribution, while the fake teacher provides repulsive guidance that pushes the student away from its own modes and thereby mitigates collapse. Under a first-order approximation, this interaction can be expressed in sample space using denoised targets $\hat{x}_{0,\mathrm{real}}$ and $\hat{x}_{0,\mathrm{fake}}$ , together with latent displacements

$z \sim \mathcal{N}(0, I)$ 0

The corresponding contrastive potential is

$z \sim \mathcal{N}(0, I)$ 1

with gradient

$z \sim \mathcal{N}(0, I)$ 2

and effective update

$z \sim \mathcal{N}(0, I)$ 3

This formalism is the baseline from which AMD departs. The method’s contribution is not to replace the DMD setting, but to modify how its push–pull signals are trusted and combined when the student drifts into pathological regions (Bai et al., 7 Feb 2026).

2. Forbidden Zones and the failure mode AMD targets

AMD defines the core failure region through the real-teacher energy $z \sim \mathcal{N}(0, I)$ 4. A Forbidden Zone is

$z \sim \mathcal{N}(0, I)$ 5

where $z \sim \mathcal{N}(0, I)$ 6 is a competence threshold beyond the real teacher’s empirical support. In these regions, the real teacher’s energy surface is described as fractured and ill-posed, so $z \sim \mathcal{N}(0, I)$ 7 can become incoherent or hallucinated. At the same time, the fake teacher’s energy is flat in extreme tails, so repulsive gradients vanish. The combined force

$z \sim \mathcal{N}(0, I)$ 8

therefore becomes unreliable or near-zero, which stalls optimization and can induce collapse (Bai et al., 7 Feb 2026).

The paper presents a generalized operator to unify DMD-like methods:

$z \sim \mathcal{N}(0, I)$ 9

where $x = G_\theta(z)$ 0 denotes the real-teacher state, $x = G_\theta(z)$ 1 the fake-teacher parameters, and $x = G_\theta(z)$ 2 an auxiliary force such as a regression tether, adversarial boundary, or reward steering. Standard DMD is recovered by the linear choice

$x = G_\theta(z)$ 3

Within this view, prior methods are reinterpreted as implicit Forbidden-Zone avoidance strategies. DMD2 adds an adversarial $x = G_\theta(z)$ 4; DMDR alternates DMD with RL steps using $x = G_\theta(z)$ 5; D-DMD uses $x = G_\theta(z)$ 6 to expand teacher support; MagicDistillation adapts the teacher on student samples, temporarily shrinking the problematic region. AMD differs in that it is explicitly organized around detecting $x = G_\theta(z)$ 7 and reweighting the native distillation signals to escape it, rather than relying on indirect avoidance (Bai et al., 7 Feb 2026).

A common misconception is that AMD is simply reward steering added to DMD. The method is presented instead as a unified optimization framework in which reward proxies serve as diagnostics for teacher competence and sample distortion; the primary control remains a reweighted decomposition of the original DMD forces, coupled to strengthened fake-teacher repulsion (Bai et al., 7 Feb 2026).

3. Structural signal decomposition and adaptive prioritization

AMD’s first major component is Dynamic Score Adaptation through structural signal decomposition. Under classifier-free guidance (CFG), the real-teacher displacement is decomposed as

$x = G_\theta(z)$ 8

This yields the standard composition

$x = G_\theta(z)$ 9

AMD names the first component the Distribution Matching term,

$t \sim U[0,1]$ 0

which anchors $t \sim U[0,1]$ 1 to the valid data manifold, and the second component the Conditional Alignment term,

$t \sim U[0,1]$ 2

which enforces semantic conditioning. The AMD operator is then

$t \sim U[0,1]$ 3

with dynamic coefficients $t \sim U[0,1]$ 4 and $t \sim U[0,1]$ 5 determined by reward-aware diagnostics (Bai et al., 7 Feb 2026).

The principle is asymmetric. In low-reward samples, which are treated as indicative of Forbidden Zones, AMD increases $t \sim U[0,1]$ 6 to prioritize corrective distribution matching and decreases $t \sim U[0,1]$ 7 to suppress potentially noisy conditional alignment. Outside such regions, the balance shifts back toward semantic refinement. The intended effect is to prevent destructive interference between a semantically strong but unreliable teacher signal and a weak fake-teacher repulsion (Bai et al., 7 Feb 2026).

Reward-awareness is formalized through a fixed reward model $t \sim U[0,1]$ 8 and a preference–competence alignment assumption. The high-reward set is

$t \sim U[0,1]$ 9

with the alignment condition

$F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$ 0

where $F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$ 1 and $F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$ 2. To remove prompt-scale variance, AMD computes group-relative advantages within a prompt-specific group $F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$ 3:

$F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$ 4

and

$F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$ 5

The adaptive coefficients are linearly modulated as

$F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$ 6

with sensitivity $F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$ 7. The detection rule is direct: $F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$ 8 indicates a low-reward sample relative to its group and triggers prioritization of the corrective DM term; an optional stronger gate uses $F_t(x) = t \cdot x + (1 - t) \cdot \epsilon,\qquad \epsilon \sim \mathcal{N}(0, I).$ 9 for aggressive correction, with $F_t(x)$ 0 (Bai et al., 7 Feb 2026).

4. Repulsive Landscape Sharpening and the self-correcting mechanism

AMD’s second major component is Repulsive Landscape Sharpening, which acts on the fake teacher. The aim is to ensure that low-reward failure cases do not remain regions of weak repulsion. The fake teacher is trained with an advantage-weighted denoising loss

$F_t(x)$ 1

where a concrete choice is

$F_t(x)$ 2

In low-reward regions, where $F_t(x)$ 3, this gives $F_t(x)$ 4 and increases training pressure on $F_t(x)$ 5 precisely at those states. The paper interprets this as locally steepening $F_t(x)$ 6, increasing $F_t(x)$ 7, and thereby increasing the magnitude of $F_t(x)$ 8 to create steep repulsive barriers against failure mode collapse (Bai et al., 7 Feb 2026).

The fake-teacher gradient is written as

$F_t(x)$ 9

The method’s training loop combines this weighted fake-teacher update with the adaptive student update. For each prompt $L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$ 0, AMD generates a group of $L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$ 1 samples, computes rewards and normalized advantages, samples diffusion times and noises, obtains real and fake teacher denoised targets, forms the coefficients $L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$ 2 and $L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$ 3, and then uses the per-sample student gradient

$L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$ 4

The fake teacher receives per-sample weight

$L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$ 5

with loss term

$L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$ 6

The update equations are

$L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$ 7

The student is therefore updated via the AMD gradient operator rather than an explicit standalone scalar objective, while the fake teacher is optimized by the advantage-weighted denoising loss. Optional regularizations such as weight decay and gradient clipping are used per backbone (Bai et al., 7 Feb 2026).

The paper reports practical ranges rather than a universal schedule. The sensitivity $L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$ 8 should keep $L_{\mathrm{DMD}} = - \mathbb{E}_{z,t}\left[\log p_{\mathrm{real}}(F_t(x)) - \log p_{\mathrm{fake}}(F_t(x))\right].$ 9 and $\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$ 0 non-negative for stability; practical ranges are chosen so that $\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$ 1 for most samples. The weighting choice $\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$ 2 is reported to work robustly, while alternative monotone decreasing functions are also permitted (Bai et al., 7 Feb 2026).

5. Implementations, benchmarks, and empirical results

AMD is evaluated on image and video generation. For images, the reported datasets and evaluations include MS-COCO 2014 (COCO-10k, COCO-30k), ImageNet (50k generation), DrawBench, HPDv2, and GenEval. For video, the reported evaluations include VBench, VBench++, VideoGen-Eval, TA-Hard, and an internal I2V benchmark with 419 images. The backbones named in the paper are SDXL (2.6B) for text-to-image, SiT-XL/2 for class-conditional ImageNet, and Wan2.1 (1.3B, 14B) for text-to-video. Reward models are HPSv2 for SDXL, DINOv2 for SiT, and VideoAlign for video generation (Bai et al., 7 Feb 2026).

For SDXL, AMD follows the DMD2 protocol, uses group generation per prompt, and is reported on 8× H800 GPUs. For SiT-XL/2, the training is two-stage: initial Pure-DMD for 3,000 iters, followed by 1,000–2,000 iters of fine-tuning with batch size 512, learning rate $\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$ 3, and DINOv2 reward. For Wan2.1-1.3B streaming, the reported setup is 700 iters, learning rate $\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$ 4, batch size 8, local attention size 12, frame sink size 3, VideoAlign reward, and outputs of 81 frames at 16 FPS for 5 s. For Wan2.1-14B bidirectional, the paper reports 800 steps, learning rate $\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$ 5, and batch size 8 (Bai et al., 7 Feb 2026).

The principal image result highlighted in the abstract is that AMD improves the HPSv2 score on SDXL from 30.64 to 31.25, outperforming state-of-the-art baselines (Bai et al., 7 Feb 2026). On SDXL COCO-10k, ImageReward improves from 71.01 for DMD2 to 88.37 for AMD, and HPSv2 improves from 30.64 to 31.25. On GenEval overall at 4 NFEs, AMD reaches 0.57 versus 0.51 for DMD2 and is described as best among distilled models, with gains in counting and attribute dimensions. On HPDv2 across styles, AMD improves averaged PickScore, HPSv2, and ImageReward, with average HPSv2 increasing from 31.64 for DMD2 to 31.97 for AMD (Bai et al., 7 Feb 2026).

On ImageNet with SiT-XL/2 at $\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$ 6 and 50k generation, AMD reduces FID from 3.5573 for DMD to 3.4690 and reduces sFID from 5.8499 to 5.7464. The reported IS is 316.02 for AMD versus 391.79 for DMDR, but the paper states that AMD avoids reward hacking and maintains good FID and sFID (Bai et al., 7 Feb 2026). This suggests that the method is positioned as a quality-preserving reward-aware correction mechanism rather than a procedure optimized to maximize reward proxies in isolation.

The video results are similarly framed as robustness gains under few-step generation. For Wan2.1-1.3B streaming, AMD improves VBench motion quality from 35.51 for LongLive to 59.26, approximately $\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$ 7, VBench total from 173.59 to 197.45, VideoGen-Eval total from 80.96 to 87.84, and TA-Hard total from 39.39 to 43.52. The paper also notes a slight TA trade-off consistent with the VideoAlign reward emphasizing motion aesthetics. For Wan2.1-14B, AMD improves VBench-I2V total from 126.36 for DMD2 to 130.72 and internal total from 118.61 to 122.15, with MQ improving by $\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$ 8. Human preference winning rate is reported to favor AMD over DMD2 across DrawBench and HPDv2 categories (Bai et al., 7 Feb 2026).

Ablation studies report that Dynamic Adaptation and Repulsive Sharpening work synergistically and that both are necessary to improve FID, IS, and reward scores. Training curves show synchronized improvements in IS and reward, which the paper interprets as evidence of effective reward-aware correction. Visualizations on toy 2D multimodal data are said to show that AMD can follow reward to escape teacher-supported but low-reward regions, either selectively modeling high-reward modes or recovering the full distribution depending on the guidance regime (Bai et al., 7 Feb 2026).

6. Trade-offs, limitations, and acronym ambiguity

AMD introduces additional compute relative to DMD and DMD2 because it requires group generation with $\nabla_\theta L_{\mathrm{DMD}} = - \mathbb{E}_{z,t} \left[ \left( s_{\mathrm{real}}(F_t(x)) - s_{\mathrm{fake}}(F_t(x)) \right) \frac{\partial G_\theta(z)}{\partial \theta} \right].$ 9 samples per prompt and reward inference. The paper states that reward models such as HPSv2, DINOv2, and VideoAlign are relatively lightweight compared with teacher passes, so the overhead is modest. Since fake-teacher training is already part of DMD, AMD’s weighting is said to add negligible cost. In return, the method is reported to stabilize trajectories by avoiding Forbidden-Zone revisits, reducing collapse events and improving robustness (Bai et al., 7 Feb 2026).

The method’s trade-offs are explicit. In the reported experiments, AMD preserves higher fidelity at 4 NFEs compared with baselines. Some video tasks show text-alignment trade-offs when the reward emphasizes motion aesthetics, and the paper recommends adjusting $\hat{x}_{0,\mathrm{real}}$ 0, $\hat{x}_{0,\mathrm{real}}$ 1, and reward-dimension emphasis accordingly. Practical guidance includes using group-relative rewards, clipping $\hat{x}_{0,\mathrm{real}}$ 2, monitoring reward and quality metrics together, logging the fraction of samples flagged by the proxy condition $\hat{x}_{0,\mathrm{real}}$ 3, increasing $\hat{x}_{0,\mathrm{real}}$ 4 or strengthening the curvature of $\hat{x}_{0,\mathrm{real}}$ 5 if collapse persists, and raising $\hat{x}_{0,\mathrm{real}}$ 6 slightly if semantics degrade (Bai et al., 7 Feb 2026).

The limitations section centers on reward quality and operator design. If the reward model $\hat{x}_{0,\mathrm{real}}$ 7 is noisy or misaligned, Forbidden-Zone detection weakens. Future work named in the paper includes robust or ensemble proxies, unsupervised Forbidden-Zone detectors, more advanced adaptive operators using momentum, orthogonal projections, or second-order information, formal guarantees on escape times and barrier heights under sharpened landscapes, and multidimensional coefficient scheduling for rewards such as VQ, MQ, and TA (Bai et al., 7 Feb 2026).

The acronym “AMD” is not unique in the literature. In vision pre-training, “AMD” can also denote “Asymmetric Masked Distillation,” a masked-autoencoding framework for pre-training relatively small ViTs by combining a lower-masking-ratio frozen teacher with a high-masking-ratio student and serial direct-plus-generation feature alignment (Zhao et al., 2023). In explainable person re-identification, “AMD” denotes “Attribute-guided Metric Distillation,” a post-hoc method that decomposes a frozen target model’s pairwise distance into attribute-wise contributions and attention maps (Chen et al., 2021). These methods are unrelated in objective, architecture, and application domain. In current few-step generative modeling, however, Adaptive Matching Distillation specifically refers to the Forbidden-Zone-aware distillation framework introduced for DMD-based image and video generation (Bai et al., 7 Feb 2026).