DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
Abstract: Diffusion LLMs intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked diffusion LLMs with three components: (1) a continuous latent space with semantic capabilities, learned by an auto-encoder fine-tuned from an existing masked diffusion LLM; (2) a latent diffusion model learning the prior over the encoder distribution; (3) a consistency model distilling the learned prior into a few-step latent generative model. We show that, even without distillation, our latent-guided diffusion model outperforms the masked diffusion baseline while significantly accelerating inference. Consistency distillation further lowers the computational overhead of continuous diffusion, such that the latent is generated in negligible time compared to discrete decoding.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces DiLaDiff, a new way to make AI models write text faster and better. It mixes two ideas:
- a “big-picture” continuous summary of a sentence (called a latent), and
- a word-by-word (token) decoder that fills in the exact words.
By combining them, the model can generate several words at once without getting confused, keeping both speed and quality high.
What were the main questions?
The researchers focused on three simple questions:
- How can we make diffusion-based text models write faster without the text getting worse?
- Can we give the model a continuous “idea sketch” of a sentence that helps it keep words consistent with each other?
- Can we compress the slow parts into just a few steps so it’s almost as fast as regular decoding?
How did they do it?
Think of writing a sentence like building a house:
- The “tokens” (words) are the bricks.
- The “latent” is the blueprint—the big-picture plan of how things should fit.
- “Diffusion” is like taking a noisy, blurry plan and gradually cleaning it up into a clear one.
Here’s the approach in plain terms:
- Build a meaningful “blueprint” (latent) for sentences using an auto-encoder
- Encoder: turns a sentence into a continuous vector (the blueprint).
- Decoder: turns that blueprint back into words.
- They start from a strong existing word-filler model (a masked diffusion LLM) so the decoder already knows how to write words well.
- They train the encoder+decoder carefully so the blueprint carries actual meaning (like topic and style) and is smooth (small changes in the latent cause small, sensible changes in the sentence).
- Learn to generate blueprints with continuous diffusion
- They train a diffusion model that starts from noise and produces good blueprints.
- This is “continuous” (numbers), not “discrete” (word IDs), which lets them use powerful math tools from image diffusion.
- Speed it up with distillation (teacher → student)
- The full diffusion process to create latents can be slow (lots of steps).
- They train a faster “student” to imitate the slow “teacher,” using a method called MeanFlow.
- Result: the student produces the blueprint in just a few steps, but still keeps quality high.
- Decode multiple words in parallel, guided by the blueprint
- Once the blueprint is ready, the token decoder fills in the words.
- Because the blueprint already captures how words should relate, the decoder can write several words at once without losing coherence.
Key terms, simply:
- Token: a word or piece of a word.
- Latent space: a space of continuous numbers summarizing the meaning of a sentence (the blueprint).
- Diffusion: a process that gradually turns noise into a meaningful object.
- Distillation: teaching a small/fast model to behave like a big/slow model.
What did they find, and why does it matter?
Main results:
- Better text with fewer decoding steps: Their hybrid model (LaDiff) improves quality compared to a standard masked diffusion LLM, even when decoding many words per step.
- Much faster generation: At a realistic batch size (32 prompts at once), LaDiff can be about 7× faster while still improving text quality metrics.
- Tiny overhead for the latent step after distillation: With the distilled version (DiLaDiff), generating the blueprint takes only a few steps (e.g., 5) and adds about 5% extra time compared to word decoding—so it’s nearly “free” in practice.
- Strong quality metrics:
- Lower generative perplexity (GenPPL), which means a separate strong LLM finds the text more predictable and fluent (lower is better).
- Higher MAUVE, a score that checks how human-like and coherent the text is (higher is better).
- Stable diversity with temperature changes: When they lower the sampling temperature (making the model pick safer words), their model keeps good variety because the blueprint already encodes diversity. Baselines tend to become repetitive when you lower temperature.
Why this matters:
- Earlier diffusion LLMs struggled to write many tokens in parallel without garbling the text because they treated tokens too independently.
- The blueprint (latent) captures global meaning and word-to-word relationships, so the decoder doesn’t get lost when it fills several blanks at once.
What’s the impact?
- Faster, high-quality text generation: Useful for chatbots, content tools, and any system that needs quick, coherent text.
- Better control: Because the blueprint carries meaning, you can imagine guiding the model toward certain topics or styles by steering the latent.
- Building block for future work: This approach can be combined with other speed-up methods (like distilling the token decoder itself) to go even faster.
Limitations and next steps:
- The distilled version (DiLaDiff) is close to, but not exactly equal to, its teacher in quality—there’s room to improve distillation.
- They did not distill the discrete decoder in this work; combining both latent and decoder distillation could push performance further.
- Some training tricks (like how to regularize the latent) are heuristic and could be made more principled.
In short: DiLaDiff gives the model a useful “idea sketch” before writing and then teaches it to make that sketch very quickly. That lets it write more words at once without mixing things up—so it’s both faster and better.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of the key gaps, limitations, and open questions left unresolved by the paper. Each item is phrased to be concrete and actionable for follow-up work.
- Conditional generation coverage is absent: the paper focuses on unconditional generation. Evaluate DiLaDiff on prompt-conditioned tasks (instruction following, paraphrase, summarization) and design architectures that integrate prompt signals into the latent and discrete decoder (e.g., cross-attention to prompts in both channels).
- Human evaluation and richer automatic metrics are missing: current evaluation relies on GenPPL (GPT-2 Large), MAUVE, and entropy. Add human studies (fluency, coherence, non-redundancy), modern LLM-as-judge metrics, repetition/degeneracy detectors, factuality, and toxicity/bias metrics.
- Limited datasets and domains: results are only on OpenWebText with BERT-base tokenizer. Test on diverse corpora (news, books, code, dialogue), multilingual data, morphological languages, and different tokenizers; assess robustness to domain shift and multilingual generalization.
- Long-context behavior is untested: sequences are capped at L=1024. Evaluate scaling to longer contexts (e.g., 4k–32k), measure paragraph-level and document-level coherence, and profile throughput/memory at long lengths.
- Baseline scope is narrow: comparisons are mainly to MDLM and DUO. Include strong autoregressive LLMs (and decoding variants) and recent continuous/discrete diffusion baselines to contextualize the speed–quality frontier.
- Discrete decoder is not distilled: few-step performance is limited primarily by discrete decoding. Integrate decoder distillation (e.g., SDTT, DCD) into DiLaDiff for joint latent+decoder distillation and benchmark in very low-step regimes (cont ≤ 5, disc ≤ 16).
- Distillation methods and samplers underexplored: MeanFlow with Euler is used; ablate and compare Consistency Models, Terminal Velocity Matching, flow-map models, higher-order ODE solvers, stochastic samplers, schedule-aware distillation, and gamma-sampling settings; investigate true one-step latents.
- Theoretical sufficiency of the latent for conditional independence is not empirically verified: the claim that the posterior factorizes given z implies z is a sufficient statistic for token dependencies. Quantify dependency reduction via mutual information, copula-based measures, or token correlation matrices over time t, and test sufficiency at high mask ratios.
- Latent regularization is heuristic: masking/noising/dropout strategies and tanh-logSNR schedules are tuned empirically. Develop principled objectives (e.g., VAE-style KL, information bottlenecks, MMD-programmable priors, contrastive regularization), and automated schedule tuning (e.g., bilevel optimization).
- Controllability is claimed but not demonstrated: beyond semantic proximity tests, show controlled generation via latent manipulation (attribute vectors, classifier-free guidance in z, conditional constraints), quantify control accuracy, disentanglement, and editability (local vs global edits).
- Robustness and failure modes are under-characterized: analyze sensitivity to sampling temperature/nucleus thresholds, mode collapse/repetition, and syntactic errors; add calibration curves for entropy vs quality and error taxonomy with qualitative examples.
- Parallel token acceptance is not quantified: provide metrics on how many tokens can be accurately updated per step (acceptance ratio, joint error rate) and how coherence degrades as disc decreases; characterize the trade-off vs latent guidance strength.
- Resource and scaling analysis is limited: overheads are reported on a single GPU. Report training/inference compute, memory, energy, multi-GPU scaling, latency breakdown (latent vs discrete), and hardware variability; assess throughput at different batch sizes and sequence lengths.
- Alternative latent priors are unexplored: compare diffusion priors to normalizing flows, autoregressive priors over z, score-based SDEs, or hybrid priors; study their impact on learnability, sample efficiency, and controllability.
- Encoder choice is fixed to BERT features: evaluate modern encoders (e.g., recent bidirectional transformers), end-to-end learned encodings, or task-specific representations; measure effects on latent semantics, decodability, and diffusion learnability.
- Long-range dependencies and global discourse structure are not assessed: measure cross-sentence coherence and discourse consistency; use tasks like story generation or narrative planning to test whether z captures paragraph-level semantics.
- Latent dimensionality vs decoder capacity trade-offs lack systematic study: run broader sweeps over compression ratios and decoder sizes, and develop guidelines (or adaptive compression) to balance representation capacity, smoothness, and decodability.
- Discrete sampling strategies are ad hoc: temperature/nucleus choices are tuned empirically. Investigate principled sampling under latent guidance (e.g., posterior correction for categorical sampling, remasking strategies, inference-time scaling) and their impact on quality/diversity.
- ELBO weighting modification (replacing with −1) lacks theoretical grounding: analyze its effect on objective variance, likelihood tightness, and convergence with and without latent conditioning; explore alternatives (variance reduction, reweighting schemes).
- Self-conditioning doubles NFEs in DiLaDiff: explore architectural or procedural changes to reduce or eliminate this overhead (shared states, recurrent latent updates, teacher-free self-conditioning), and quantify the trade-off vs quality.
- Distillation curricula and schedule mismatches are not explored: experiment with progressive distillation (teacher steps/schedules), knowledge transfer under mismatched noise schedules, and multi-teacher ensembles; study stability and sample quality across curricula.
- Integration with conditional tasks and controls is open: design cross-attention pathways that jointly condition on prompts and latent z, and evaluate steering with content/planning controls; test robustness to prompt length and complexity.
- Safety, bias, and toxicity impacts are not studied: add standardized safety evaluations, controllability for safe outputs via latent constraints, and investigate whether latent modeling amplifies or mitigates harmful attributes.
Practical Applications
Immediate Applications
The following applications can be deployed with modest engineering effort by adapting the paper’s training/inference recipe (auto-encoder + latent diffusion + MeanFlow distillation) to target corpora and integrating with existing masked diffusion LLM (MDLM) infrastructure.
- Latent-guided high-throughput text generation for content platforms
- Sectors: software, media/creative, marketing, e-commerce
- What: Replace pure discrete diffusion LMs with DiLaDiff to generate articles, ad copy, product descriptions, and social posts with higher coherence under parallel decoding and lower latency.
- Tools/workflows:
- A “DiLaDiff Inference Engine” (e.g., Triton/TensorRT backend) that generates latents in ~5 steps and decodes in 32–64 steps, exploiting batch-friendly latent diffusion.
- A temperature/nucleus “quality dial” tuned per domain; latent guidance enables lower temperature without collapsing diversity.
- Assumptions/dependencies:
- Requires a pre-trained MDLM decoder and domain-specific fine-tuning of the auto-encoder and latent prior.
- Gains depend on batch sizes and hardware (paper reports strong gains at BS=32; metrics on OpenWebText).
- Faster, cheaper server-side text assistants
- Sectors: software/SaaS, customer support, productivity
- What: Use DiLaDiff to reduce inference cost and carbon footprint for chatbots, email drafting, and summarization at scale by decreasing discrete decoding steps while preserving quality.
- Tools/workflows:
- A “throughput optimizer” that selects (cont, disc) per request (e.g., cont=5, disc=64 for speed-sensitive; cont=200, disc=1024 for quality).
- Batch scheduler that exploits latent diffusion’s better scaling with BS to consolidate traffic.
- Assumptions/dependencies:
- Latent diffusion overhead is small after distillation (~5% reported) but discrete decoding remains a cost driver.
- Not a drop-in for AR LLMs; requires MDLM-based stack adoption.
- Parallel text infilling/editing for productivity apps
- Sectors: productivity software, documentation, IDE writing assistants
- What: Mask-based editing workflows (paragraph infill, rewrite, paraphrase) benefit from DiLaDiff’s ability to recover many tokens per step with better joint consistency.
- Tools/workflows:
- “Smart Infill” feature: select spans, generate multiple coherent options in parallel via latent-guided decoding.
- “Semantic variability slider” that perturbs latents to explore controlled alternatives without destabilizing syntax.
- Assumptions/dependencies:
- Requires integrating masked decoding UX and ensuring on-the-fly latent sampling latency targets are met.
- Domain-adapted generators with semantic compression for caching
- Sectors: finance, healthcare (non-diagnostic workflow), legal ops
- What: Store/reuse semantically meaningful latents (≈2× compression sweet spot in the paper) for fast re-generation or minor edits to recurring documents (templates, boilerplate).
- Tools/workflows:
- “Latent cache”: store encoder latents for common templates; regenerate final text via fast discrete decoding; allow latent perturbations for variation.
- Assumptions/dependencies:
- Latents are not a generic doc embedding; they are decoder-facing and trained for a given MDLM.
- Energy/Cost optimization for data centers
- Sectors: energy, cloud computing, MLOps
- What: Replace many-step discrete diffusion with hybrid DiLaDiff to reduce NFEs and wall-time, cutting power and cost per token.
- Tools/workflows:
- MLOps dashboards that track wall-time allocation between continuous and discrete paths, selecting distilled latent steps to minimize energy per output.
- Assumptions/dependencies:
- Real-world energy savings vary with hardware utilization, batching, and utilization of mixed-precision kernels.
- Robust paraphrasing and style transfer in education tools
- Sectors: education, writing aids
- What: Use latent perturbations to generate controlled paraphrases preserving semantics (as shown by BERTScore experiments) while modulating lexical variation.
- Tools/workflows:
- “Style slider” that injects calibrated latent noise for gentle-to-strong rewrites; batch generate alternates for teachers/students.
- Assumptions/dependencies:
- Requires careful UX to prevent plagiarism/over-similarity; ensure age-appropriate guardrails (not addressed in the paper).
- Compliance-friendly generation with lower sampling temperature
- Sectors: regulated industries (finance, healthcare documentation, government)
- What: Latent guidance lets decoders run at lower temperature while maintaining diversity, reducing off-policy outputs and simplifying human review.
- Tools/workflows:
- “Conservative mode” knob for safe text; combine with nucleus thresholds for auditability.
- Assumptions/dependencies:
- Safety/guardrails still needed; study shows entropy preservation under lower temperature but not full safety.
Long-Term Applications
These scenarios need additional research, scaling, or system integration (e.g., discrete decoder distillation, task-specific fine-tuning, multimodal fusion, safety).
- Distilled few-step end-to-end hybrids rivaling SOTA few-step methods
- Sectors: software, mobile/on-device assistants
- What: Combine DiLaDiff’s latent distillation with discrete decoder distillation (e.g., SDTT/DCD-style) to achieve low-step inference end-to-end for real-time applications and edge devices.
- Tools/products:
- “Few-step hybrid text generator” delivering near-teacher quality with <10 total steps.
- Assumptions/dependencies:
- Paper notes discrete decoder is not yet distilled; closing this gap is future work.
- Latent-controlled, constraint-following generation (semantic controls)
- Sectors: media, legal/compliance, enterprise content authoring
- What: Shape latent trajectories to meet high-level constraints (style, tone, reading level), enabling controllable generation beyond token-level logits hacks.
- Tools/products:
- “Latent controller” library for adding constraint vectors or learned control modules (akin to ControlNet but for text latents).
- Assumptions/dependencies:
- Requires research on disentanglement and robust control in the latent space; risk of content drift without guarantees.
- Structured sequence generation where token dependency matters (code, contracts)
- Sectors: software engineering, legal tech
- What: Use latent-guided parallel decoding to better maintain long-range dependencies in code or legal clauses while accelerating generation.
- Tools/products:
- “Hybrid code assistant” that trains an auto-encoder/latent prior on code corpora; integrates with IDEs for fast skeleton generation and infill.
- Assumptions/dependencies:
- Needs code/data-specific training; strong syntax/semantic constraints may still favor AR models or require additional validators.
- Privacy-first, on-premise document drafting with semantic compression
- Sectors: healthcare (clinical note drafting), finance (reporting), government
- What: Run DiLaDiff pipelines on private clusters; cache latents for frequent templates; perturb latents for variants while keeping PHI inside perimeter.
- Tools/products:
- “Private latent cache” and “template expander” appliances deployed on secure infrastructure.
- Assumptions/dependencies:
- Clinical/financial correctness not guaranteed by generative quality metrics; requires task-aligned training and compliance review.
- Retrieval and memory via latent storage (semantic reconstruction)
- Sectors: knowledge management, search
- What: Store compressed, decodable latents for documents/emails to reduce storage and enable reconstructive summarization/editing.
- Tools/products:
- “Reconstructive RAG”: retrieve latents + decode with constraints to produce tailored summaries or updates.
- Assumptions/dependencies:
- Latent space is trained for generation, not robust retrieval; needs alignment with retrieval objectives and stability guarantees.
- Multimodal planning and instruction following in robotics
- Sectors: robotics, human-robot interaction
- What: Use latent-guided text to maintain coherent plans/instructions under parallel decoding; integrate with planners that consume text outlines.
- Tools/products:
- “Latent planner interface” that maps goal constraints to latent controls for concise plan texts.
- Assumptions/dependencies:
- Requires task rewards and grounding; text quality alone does not ensure plan executability.
- Energy-aware, adaptive inference schedulers
- Sectors: cloud, green AI
- What: Use learned policies to pick (cont, disc) and sampler settings per request to optimize energy/latency/quality trade-offs in real time.
- Tools/products:
- “Green scheduler” that leverages battery/grid signals and SLAs to allocate latent vs. discrete steps.
- Assumptions/dependencies:
- Needs telemetry and policy learning; interactions with batch queueing and QoS must be engineered.
- Safety-aligned generation via latent-space training signals
- Sectors: all regulated domains
- What: Inject safety constraints in the latent prior (e.g., penalty on trajectories leading to unsafe completions), aiming for earlier, global control than token-level filtering.
- Tools/products:
- “Safety-shaping” routines that fine-tune the latent diffusion with preference or rule-based signals.
- Assumptions/dependencies:
- Active research area; requires reliable safety signal definitions and evaluation beyond perplexity/MAUVE.
- Curriculum and assessment tools with controlled semantic variation
- Sectors: education
- What: Generate families of questions/explanations with calibrated semantic drift for practice/assessment, using latent noise schedules to space difficulty.
- Tools/products:
- “Difficulty dial” that maps latent perturbation magnitudes to content variation levels.
- Assumptions/dependencies:
- Needs alignment between latent distance and pedagogical difficulty; evaluation with educators.
- Compressive caching for serving pipelines
- Sectors: MLOps, inference platforms
- What: Cache intermediate latents during multi-turn interactions to resume, branch, or audit generations without re-running full pipelines.
- Tools/products:
- “Session latent cache” with versioning and audit trails; supports branch-and-compare via deterministic decoding seeds.
- Assumptions/dependencies:
- Requires stable latent-to-text mapping across model updates or version pinning.
Cross-cutting dependencies and caveats
- Model dependence: Gains demonstrated on OpenWebText with BERT features and MDLM backbones; transferring to other tokenizers, vocabularies, or domains requires re-training and hyperparameter re-tuning (regularization strength, compression ratio).
- Distillation scope: Current paper distills only the latent ODE; few-step end-to-end performance will improve further when the discrete decoder is also distilled.
- Evaluation: GenPPL/MAUVE improvements correlate with quality but are not task guarantees (e.g., factuality, safety, legal correctness). Human/task-based evaluations are needed per deployment.
- Infrastructure: Benefits are most pronounced at moderate-to-large batch sizes and GPU inference; on-device or CPU-only scenarios may need further engineering (kernel fusion, quantization).
- Safety/compliance: The method improves efficiency and coherence but does not itself implement guardrails; combine with content filters, policy constraints, and auditing.
Glossary
- Ancestral sampling: A stepwise generative procedure that samples from successive conditional distributions in the reverse process. "We then perform ancestral sampling in the token space"
- Argmax operator: The function that selects the index of the maximum value, used to decode discrete tokens from continuous representations. "or argmax operator in the case of one-hot embeddings."
- Auto-encoder: An encoder–decoder model that learns a compressed latent representation of data for reconstruction or downstream generation. "We propose a recipe for training a text auto-encoder, where the decoder is initialized from an pre-trained discrete diffusion baseline."
- Average velocity: In MeanFlow, the time-averaged displacement (vector field) along an ODE trajectory between two times, used for few-step sampling. "MeanFlow tasks our student DiLaDiff (Distilled Latent-augmented Diffusion LLM) with learning the average velocity "
- BERTScore-F1: A semantic similarity metric for text generation based on contextual embeddings. "i.e., decreasing BERTScore-F1~\citep{zhang2020bertscoreevaluatingtextgeneration} between the decoded sentences"
- Categorical prior: A prior over discrete categories (e.g., a mask token) used in the forward corruption process of discrete diffusion. "a categorical prior generating a fully masked sequence at ."
- Categorical sampling: Drawing discrete tokens from a categorical distribution during decoding. "due to latent space compression and the absence of logits projection / categorical sampling"
- Consistency distillation: Training that distills continuous trajectories into a few-step generator via consistency-based objectives. "Consistency distillation further lowers the computational overhead of continuous diffusion,"
- Consistency model: A model trained to produce self-consistent outputs across different noise or time levels, enabling few-step sampling. "a consistency model distilling the learned prior into a few-step latent generative model."
- Cross-attention: An attention mechanism that conditions one sequence (or latent) on another by attending across them. "In the encoder, the latent variable is learned via cross-attention to the hidden state ."
- DDPM-style parameterization: The standard denoising diffusion probabilistic modeling approach for parameterizing reverse transitions. "Using a DDPM-style parameterization \citep{ho2020denoising}, MDLMs approximate the conditional denoiser"
- Denoiser: The neural network that predicts the clean signal (or noise) from a corrupted input at a given time in diffusion. "passed to a diffusion denoiser trained with the following ELBO:"
- Denoising Transformer: A Transformer used as the denoiser in diffusion to reconstruct clean tokens from corrupted inputs. "by a denoising Transformer with parameters ."
- Discrete diffusion models: Diffusion methods defined over categorical variables (tokens) rather than continuous signals. "discrete diffusion models cannot decode many tokens in parallel accurately,"
- Evidence lower-bound (ELBO): A variational objective that lower-bounds the log-likelihood and is used to train diffusion or auto-encoding components. "are trained using an evidence lower-bound (ELBO) on the model likelihood:"
- Flow-matching loss: An objective that trains a vector field to match target ODE flows, enabling few-step generative modeling. "with a learning rate of 5e-5 and 25\% pure flow-matching loss ()."
- Forward masking kernel: The corruption process in masked diffusion that replaces tokens with [MASK] according to a schedule. "The forward masking kernel interpolates between the clean distribution at and a categorical prior"
- -sampling: A sampling heuristic for few-step consistency/flow models controlled by parameter . "For few-step sampling with DiLaDiff, we use -sampling \citep{kim2024consistencytrajectorymodelslearning} with "
- Generative perplexity (GenPPL): A metric computed by an external LM to assess the quality of generated text. "We report generative perplexity (GenPPL) using GPT2-Large"
- Instantaneous velocity: The time derivative of the trajectory at a specific time in the probability-flow ODE. "the average velocity converges to the instantaneous velocity "
- Latent denoiser NFEs: The number of function evaluations of the latent denoiser required during ODE solving. "Because of the modified self-conditioning mechanism in DiLaDiff, one call of DiLaDiff requires two latent denoiser NFEs."
- Latent diffusion model: A diffusion model operating in a continuous latent space rather than directly in token space. "a latent diffusion model learning the prior over the encoder distribution;"
- Latent-guided diffusion model: A model where the latent variable provides global guidance to the token-space diffusion process. "our latent-guided diffusion model outperforms the masked diffusion baseline"
- Latent prior: The learned distribution over latent variables used at inference to generate before decoding to tokens. "We learn the latent prior with a continuous diffusion model,"
- Linear noise schedule: A schedule where the corruption (e.g., masking) level increases linearly over time. "The proportion of masked tokens at each time step is given by the commonly used linear noise schedule ."
- MAUVE: A divergence-frontier-based metric that measures the gap between generated and human text distributions. "we also report MAUVE \citep{pillutla2021mauvemeasuringgapneural},"
- MeanFlow: A few-step generative modeling method that learns average velocities along ODE trajectories. "In this work, we use MeanFlow \citep{geng2025meanflowsonestepgenerative} as it naturally allows multi-step generation"
- Nucleus filtering: Truncating a probability distribution to the smallest set of tokens whose cumulative mass exceeds p (top-p). "and nucleus filtering with ."
- Nucleus sampling: Sampling from the truncated (top-p) distribution to control diversity and quality. "and nucleus sampling (with a probability threshold of 0.9) is required to obtain reasonable sample quality"
- Perceiver-inspired encoder: An encoder architecture inspired by Perceiver that uses iterative attention for compression. "a Perceiver-inspired encoder \citep{jaegle2021perceivergeneralperceptioniterative}"
- Probability-flow ODE: The deterministic ODE whose solution shares marginals with the diffusion SDE, enabling ODE-based sampling. "the ability to distill the resulting probability-flow ODE trajectory"
- Reverse diffusion posterior: The conditional distribution used to denoise from a later to an earlier time in diffusion. "and progressively sampling the reverse diffusion posterior~(\ref{eq:lmdm-rev-full})"
- Self-conditioning mechanism: Feeding previous model predictions back into the model to improve stability or efficiency. "Because of the modified self-conditioning mechanism in DiLaDiff, one call of DiLaDiff requires two latent denoiser NFEs."
- Self-distilled: A model distilled from its own teacher variant, reducing steps while retaining performance. "DiLaDiff: hybrid continuous-discrete diffusion with self-distilled latent."
- Tanh-logSNR variance-preserving noise schedule: A specific continuous diffusion schedule parameterized via a tanh transformation of log-SNR that preserves variance. "We use the parametric \mbox{tanh-logSNR} variance-preserving noise schedule"
- Temperature logits scaling: Adjusting logits by a temperature to control randomness and diversity at sampling time. "we observe the influence of temperature logits scaling and nucleus sampling"
- Token-wise marginals: Per-token marginal distributions assumed independent in certain diffusion approximations. "as a product of token-wise marginals,"
- Zero-initialized pointwise convolutional layers: Conditioning layers initialized at zero so that conditioning is learned stably from a neutral start. "wrapped in zero-initialized pointwise convolutional layers, to enable the decoder's hidden state to capture the latent information."
Collections
Sign up for free to add this paper to one or more collections.