Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion
Abstract: Diffusion LLMs (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting continuous state spaces are highly effective for language. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. Our approach represents semantic tokens as analog bit sequences and utilizes a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, automatically concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On the One Billion Word Benchmark (LM1B), our 130M-parameter bitstream model reaches a generative perplexity ($\GenPPL$) of $59.76$ at matched real-data entropy ($4.31$) using 256 neural function evaluations (NFEs), decisively outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our stochastic sampler establishes a new continuous-DLM Pareto frontier, achieving $\GenPPL=27.06$ at an entropy of $5.26$ using $4\times$ fewer steps than previous 1024-NFE baselines. As an additional architectural benefit, bitstream diffusion removes the $\mathcal{O}(V)$ vocabulary scaling bottleneck shared by standard DLMs. By predicting $\mathcal{O}(\log V)$ bitwise logits via semantic bit-patching, our model yields a reduced memory footprint and higher throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper shows a new way to make diffusion-based LLMs generate text that’s as good as, or even better than, the usual autoregressive models, while keeping some big advantages of diffusion. The key idea is to represent text as streams of bits (0s and 1s) and clean up noisy versions of those bits in parallel, using smart rules that add randomness only where it helps.
What questions were the authors trying to answer?
- Can a diffusion LLM (which edits all words at once) reach the text quality of autoregressive models (which write one word after another)?
- If we switch from predicting full words to predicting bits, can we make learning and generation faster and more memory‑friendly?
- Can we design a sampling method that adds randomness only where the model is unsure, so outputs are both high‑quality and diverse?
How did they do it? (Everyday explanations)
Here are the main parts of their approach, using simple analogies:
- Turning words into bit “barcodes”:
- Instead of predicting the next word from a huge list, the model turns each word into a fixed number of bits, like a barcode. For example, if the vocabulary has V words, each word becomes about
log2(V)bits. - The model starts from pure noise and gradually “unblurs” all bits for all positions in parallel. This is like sharpening a blurry image—except it’s for the bit barcodes that represent text.
- Instead of predicting the next word from a huge list, the model turns each word into a fixed number of bits, like a barcode. For example, if the vocabulary has V words, each word becomes about
- Predicting bits instead of whole words:
- Most LLMs must output one score for every word in the vocabulary (
O(V)), which gets very heavy when V is large. This method only predictsO(log V)bit scores per word, which is much lighter. - Think of it as deciding each yes/no switch in a word’s barcode rather than picking one item from a massive menu.
- Most LLMs must output one score for every word in the vocabulary (
- Matched‑filter residual: split the problem into “easy” and “hard” parts:
- For a single bit that’s been noised, math can already tell you a very good baseline guess of that bit’s value. The model then focuses only on the extra “context” needed—how surrounding bits and words change that bit—rather than relearning the obvious.
- Analogy: a noise‑canceling headphone removes static automatically, and your brain fills in the sentence using context. The headphone = analytic baseline; your brain = the model’s learned “residual” correction.
- Smarter sampling with “entropy‑gated” randomness:
- Perplexity measures how surprised a good reader model would be by your text (lower is better). Entropy measures variety/diversity in word choice (higher means more diverse). Good text should be both accurate and not dull.
- A purely deterministic sampler tended to “play it safe,” giving lower variety than real text. So they add a small, controlled amount of randomness like gentle nudges.
- Crucially, they add this randomness only where the model is most uncertain (high‑information or high‑entropy parts), and keep things nearly deterministic where it’s confident. Analogy: when walking a maze, you explore more at confusing crossroads and walk straight when the path is obvious.
- They do this by building a step schedule (an “entropy‑rate grid”) that spends more effort and randomness at the tricky parts of the generation process.
- A few terms in everyday language:
- Diffusion model: starts from noise and repeatedly cleans it to produce text.
- NFE (Neural Function Evaluations): roughly, how many “clean‑up steps” the model takes. Fewer NFEs = faster sampling.
- Perplexity: how well another strong model can predict your text; lower means your text looks more realistic.
- Entropy: how varied your token choices are; matching real data’s entropy means you’re not just repeating the same common words.
What did they find?
The authors test on two standard datasets and compare both text quality (perplexity) and variety (entropy):
- One Billion Word Benchmark (LM1B), 130M‑parameter model, 256 steps:
- Deterministic sampling: Generative Perplexity ≈ 82.9 at entropy ≈ 4.30.
- With entropy‑gated randomness: Generative Perplexity ≈ 59.8 at entropy ≈ 4.31 (very close to real data’s 4.31).
- This matches or beats the autoregressive baseline (≈ 66.7) and clearly beats prior diffusion models.
- OpenWebText (OWT), 256 steps:
- Deterministic sampling: Generative Perplexity ≈ 46.3 at entropy ≈ 5.13.
- With entropy‑gated randomness: Generative Perplexity ≈ 27.1 at entropy ≈ 5.26.
- This outperforms a strong recent continuous diffusion model (LangFlow) that needed 4× more steps (1024) to reach a higher perplexity (worse).
- Big speed and memory wins from bits:
- Because they predict bits instead of full vocabulary scores, the model uses much less memory and runs faster:
- Training: up to about 2–3× faster and ~1.6–2.5× less memory, depending on dataset size.
- Generation: up to ~2–3× faster and as much as ~19× less memory at large scales.
- These gains grow as vocabularies and context lengths get bigger.
Why is this important? It shows diffusion LLMs don’t have to be worse than autoregressive models in quality, and they can be much more efficient in how they handle big vocabularies.
What does it all mean?
- Better quality without giving up diffusion’s strengths: This method helps diffusion models reach autoregressive‑level quality while keeping advantages like parallel editing of all positions and easy infilling.
- Scales to huge vocabularies and mixed data: Predicting
O(log V)bits per token instead ofO(V)word scores makes the approach more memory‑efficient and potentially ideal for very large, multilingual, or multimodal vocabularies. - Smarter randomness is key: Adding randomness only where the model is uncertain improves both realism and diversity. This principle could help other generative models too.
- Practical note: Diffusion still takes multiple steps to generate, which can be slower than one‑step autoregressive decoding. Future work on faster samplers or distillation could reduce this.
In short, the paper introduces a bit‑based, carefully sampled diffusion method that pushes diffusion LLMs much closer to, and sometimes beyond, the quality of standard autoregressive models, while also making them more efficient and scalable.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper advances continuous bitstream diffusion for language modeling, but it leaves several important questions unanswered. The following concrete gaps can guide future research:
- Scaling and generalization
- Does the reported Pareto frontier (GenPPL–entropy) persist at larger model sizes (e.g., >1B parameters), longer contexts (e.g., 8k–32k), and on broader corpora (e.g., The Pile, multilingual datasets), beyond LM1B and OWT?
- How robust are the gains across different tokenizers and domains (e.g., code, biomedical, low-resource languages), where token distributions and entropy profiles differ substantially?
- What is the run-to-run training variance (beyond sampling-seed variance) in performance and entropy when training from scratch?
- Evaluation methodology
- Reliance on “generative perplexity” measured by an external evaluator (gpt2-large) can be evaluator-dependent; how do results change with stronger or different evaluators, or with human evaluation (fluency, coherence, factuality)?
- The paper does not report likelihood or NLL bounds for this model; can one derive and compute a tractable likelihood estimator or tight bounds for fair comparison with AR and discrete DLMs?
- Are the conclusions stable under alternative diversity/quality metrics (e.g., MAUVE, distinct-n, self-BLEU, calibration of token probabilities), and in conditional tasks (prompted continuation, infilling, editing) that the method claims to support but does not evaluate?
- Bitstream representation and coding
- The matched-filter posterior assumes an isolated bit with a uniform prior; real bit marginals induced by token codes are not uniform. Would using non-uniform priors, learned priors, or code-aware matched filters improve accuracy and calibration?
- Impact of the code assignment: how sensitive are results to the specific fixed-width codebook (e.g., gpt2id_bpe16, raw BERT IDs), bit ordering within a token, or to alternative encodings (Gray codes, error-correcting codes, learned semantic codebooks)?
- Are there benefits to learning the bit mapping jointly with the model (end-to-end code learning) to better align bit geometry with semantic structure and reduce inter-bit dependencies?
- What is the error profile of the final thresholding step (bit → token) and are there decoding strategies (e.g., constrained decoding, error-correcting decoders, uncertainty-aware bit sampling) that reduce token errors without sacrificing diversity?
- Sampler design and theory
- Entropy-gated stochastic churn effectiveness is shown empirically, but what are the theoretical guarantees under finite-step integration (bias/variance trade-offs, stability, convergence) and how do they compare to explicit reverse-SDE solvers?
- Sensitivity analysis for churn hyperparameters (Snoise, total churn budget, windowing) and grid construction is limited; can we devise principled or adaptive controllers that target a desired entropy/quality level automatically?
- The entropy-rate grid is estimated online via unweighted denoising errors as a proxy for conditional entropy. How accurate is this proxy across datasets and training stages, and can tighter estimators (e.g., I–MMSE-consistent or learned predictors) yield better schedules?
- Can the benefits of entropy-gated stochasticity be retained with far fewer NFEs via distilled samplers (e.g., consistency models, DPM distillation, 1–8 step samplers), and what is the best distillation target in bitstream space?
- Objective and parameterization
- Binary score matching outperforms cross-entropy in this setup, but is this result robust to different noise models (VE vs VP), alternative weightings, or hybrid objectives that directly optimize both bitwise scores and token-level likelihoods?
- The matched-filter residual parameterization is effective, yet its assumptions (Gaussian corruption, uniform-prior bit baseline, clipping with C=30) are heuristic. How sensitive are results to these choices, and can we design principled generalizations for non-Gaussian noise or non-uniform priors?
- Does the residual parameterization encourage the model to underuse global context (over-rely on the skip/local bit path)? Probing analyses are needed to quantify how much contextual information the trunk actually contributes.
- Comparative baselines and fairness
- The discrete bit-level baseline (SEDD-style) is weak; could stronger discrete methods (e.g., improved transition kernels, ratio estimators, or masking strategies tailored for bits) close the gap, isolating whether gains stem from continuity or from representation/architecture?
- Are comparisons to AR fully controlled for sampling temperature and entropy? Matching entropy with AR baselines (and reporting AR Pareto curves) would clarify relative trade-offs.
- How does the bitstream approach compare to token-space continuous DLMs that reduce O(V) via low-rank or MoE output heads, both in quality and in end-to-end systems metrics?
- Inference cost and practicality
- Despite O(log V) output scaling, inference remains iterative; what is the latency/throughput at application-relevant settings (e.g., near-AR latency)? How well do caching, partial denoising, or reuse of intermediate states work in the bitstream setting?
- What is the cost–quality curve under aggressive NFE reduction (e.g., ≤32 steps) and can the method maintain its GenPPL–entropy frontier in low-latency regimes?
- Capabilities beyond unconditional generation
- Although diffusion enables parallel editing/infilling, the paper does not demonstrate infilling, masked editing, or constrained generation; what adaptations (conditioning interfaces, boundary handling) are needed and how does performance compare to AR and discrete DLMs?
- How does the model handle long-form coherence, discourse structure, and factual consistency over long contexts relative to AR baselines?
- Robustness, safety, and calibration
- Are bitwise probabilities and resulting token distributions well-calibrated across noise levels and positions? Calibration diagnostics (e.g., Brier score, ECE) are not reported.
- Robustness to domain shift, adversarial or noisy inputs, and stability under self-conditioning carry mode are not evaluated; does carry-mode amplify errors or collapse diversity in edge cases?
- Systems and scaling behavior
- The systems gains are measured on synthetic batches; do they hold in real training pipelines with dataloading, tokenization, decoding, and multi-GPU/distributed setups, and under mixed precision or quantization?
- As T and V grow further (e.g., unified multilingual/multimodal vocabularies), where does the next bottleneck emerge (trunk compute, attention memory), and how does the bitstream approach interact with sparse or linear attention variants?
These open questions suggest concrete next steps: scaling studies; alternative code designs and learned encodings; principled, adaptive stochastic samplers; likelihood estimation; infilling and conditional evaluations; calibration and robustness analysis; and integration with accelerated few-step or consistency-based generation.
Practical Applications
Below is a concise mapping from the paper’s methods and empirical findings to real-world applications. Each item notes plausible sectors, potential tools/workflows, and key assumptions or dependencies.
Immediate Applications
These can be piloted or deployed now with moderate engineering effort, leveraging the paper’s released design patterns and measured systems gains.
- Bitstream output head to remove the O(V) vocabulary bottleneck
- Sectors: software infrastructure, model training/inference providers, academia
- What: Replace token-level logits with O(log V) bitwise logits via semantic bit-patching, yielding large memory and throughput improvements (e.g., up to ~19× lower generation VRAM and ≥2× throughput in tested settings).
- Tools/workflows:
- Drop-in “bitstream head” module for sequence diffusion transformers
- Reversible codecs (e.g., gpt2id_bpe16) to map tokenizer IDs to fixed-width codes
- Training recipe with matched-filter residual parameterization and self-conditioning
- Assumptions/dependencies:
- Requires a reversible, fixed-width code mapping for the target tokenizer(s)
- Integration effort to patch bits per token and modify the loss/output stack
- Maintains iterative denoising at inference (latency depends on NFE)
- Entropy-gated stochastic sampling as a test-time upgrade
- Sectors: model serving, research labs, applied ML teams
- What: Improved quality–diversity frontier by applying full-band EDM churn on an entropy-rate grid, without retraining or increasing NFE.
- Tools/workflows:
- Add churn to existing continuous DLM samplers (PyTorch/JAX samplers)
- Online or offline estimation of the entropy-rate schedule for the sigma grid
- Expose “creativity” control via churn strength as an end-user knob
- Assumptions/dependencies:
- Works for continuous DLMs with accessible scores/posteriors
- Gains depend on using an entropy-aligned sigma grid; naive churn is less effective
- More memory- and cost-efficient long-context generation
- Sectors: enterprise AI (document processing), education, research compute
- What: Serve larger batches or longer contexts (e.g., T=1024) on the same hardware, reducing cost per token and improving utilization.
- Tools/workflows:
- Batch schedulers tuned for diffusion generation
- Consolidated serving stacks for high-throughput, batched non-AR generation
- Assumptions/dependencies:
- Users accept iterative denoising latency (e.g., ~256 NFEs); suitable for batch/async workloads
- Parallel refinement for editing and infilling
- Sectors: productivity software, IDEs/code tooling, collaborative editors
- What: Non-autoregressive parallel updates across positions enable arbitrary infilling, multi-span edits, and order-agnostic refinement workflows.
- Tools/workflows:
- UI for parallel edits with iterative refinement cycles
- Server-side samplers with carry-mode self-conditioning
- Assumptions/dependencies:
- Requires conditioning mechanisms for masked/known segments
- Latency tolerance for multi-step refinement (best in non-real-time settings)
- Research democratization via smaller hardware footprints
- Sectors: academia, startups, education
- What: Train and experiment with continuous DLMs on commodity GPUs thanks to the reduced output boundary and throughput improvements.
- Tools/workflows:
- Open-source reference implementations of bitstream diffusion (head + schedule + sampler)
- Evaluation pipelines tracking GenPPL jointly with token entropy
- Assumptions/dependencies:
- Results demonstrated at ~130M parameters; behavior at larger scales is an open question
- Better evaluation practice to discourage entropy collapse
- Sectors: research methodology, benchmarking orgs, policy/standards
- What: Jointly report generative perplexity and token-frequency entropy to detect “safe token” collapse and compare models on a Pareto frontier.
- Tools/workflows:
- Benchmark scripts computing external GenPPL (e.g., with GPT-2 large) and per-sample unigram entropy
- Assumptions/dependencies:
- Requires agreement on evaluation protocols and reference tokenizers
- Bitstream formulation for multilingual prototypes
- Sectors: multilingual NLP research, low-resource language tech
- What: Build prototypes with large or composite vocabularies (multiple languages) without the O(V) penalty.
- Tools/workflows:
- Large-vocabulary reversible codecs spanning multiple token sets
- Bit-patched SDT trunk with shared head
- Assumptions/dependencies:
- Codec design must preserve fidelity across languages and special tokens
- Quality–diversity “knob” for creative applications
- Sectors: media/marketing, content tools, game narrative
- What: Surface churn strength as a user-facing creativity control, balancing novelty and coherence without retraining.
- Tools/workflows:
- UX slider mapped to S_churn with presets aligned to target entropy bands
- Assumptions/dependencies:
- Requires monitoring to avoid excessive randomness degrading quality
Long-Term Applications
These depend on further research, scaling, or engineering (e.g., step reduction, distillation, multimodal codebooks).
- General-purpose diffusion LMs rivaling autoregressive models at scale
- Sectors: general AI platforms, cloud providers
- What: Close the AR gap while exploiting parallel refinement, with competitive or superior quality at large model sizes.
- Tools/workflows:
- Step-reduction via flow/consistency distillation or one-/few-step methods
- KV/state caching analogs for diffusion decoders; specialized kernels
- Assumptions/dependencies:
- Demonstrated at 130M params; scaling to multi-billion parameters is unproven
- Latency must approach AR via aggressive distillation/solver advances
- Unified multimodal generators with logarithmic output scaling
- Sectors: multimodal AI (text–image–audio–video)
- What: Shared bitstream interface across modalities sidesteps per-modality large vocabularies, easing training and serving.
- Tools/workflows:
- Reversible codebooks for modality-specific tokens (VQ/image tokens, audio units)
- Cross-modal patching and shared SDT trunks
- Assumptions/dependencies:
- Requires robust, information-preserving codecs and alignment across modalities
- On-device or edge diffusion generation
- Sectors: mobile, embedded, privacy-preserving AI
- What: Memory savings plus step-reduced samplers enable on-device parallel editing/generation for privacy-critical settings.
- Tools/workflows:
- Few-step student models distilled from bitstream teachers
- Mobile-optimized kernels for bitwise heads and samplers
- Assumptions/dependencies:
- Iterative denoising must be reduced substantially (e.g., <10 steps)
- Thermal/energy constraints and hardware acceleration support
- Enterprise long-context assistants with massive composite vocabularies
- Sectors: legal, finance, healthcare, scientific R&D
- What: Train assistants over domain-specific terminologies and documents (large V) without exploding the output boundary.
- Tools/workflows:
- Organization-specific reversible vocab codecs
- Retrieval + non-AR infilling pipelines for large-document workflows
- Assumptions/dependencies:
- Must validate domain safety/reliability; human evaluation beyond GenPPL
- Risk- and safety-aware stochastic controllers
- Sectors: regulated industries (healthcare, finance), safety-critical documentation
- What: Use entropy-gated stochasticity to modulate uncertainty in “information-active” regions, aiming to reduce mode collapse or manage hallucination risk.
- Tools/workflows:
- Runtime policies that adjust churn based on uncertainty thresholds
- Monitoring of entropy profiles during generation
- Assumptions/dependencies:
- Requires empirical validation linking entropy profiles to downstream risk
- Communication and coding theory-inspired decoders
- Sectors: communications, storage, error correction
- What: Leverage matched-filter residuals: combine analytic local channel posteriors with learned contextual residuals for structured decoding.
- Tools/workflows:
- Learned decoders for noisy bit channels with contextual constraints
- Assumptions/dependencies:
- Needs adaptation to true channel models and latency constraints in comms
- Hybrid AR–diffusion pipelines
- Sectors: software tooling, content generation
- What: Use diffusion over bitstreams to propose/edit spans in parallel, with an AR verifier/refiner for final tokenization or safety filtering.
- Tools/workflows:
- Two-stage pipelines (diffusion propose → AR check/refine)
- Interfaces for constrained decoding (e.g., fixed bits for known spans)
- Assumptions/dependencies:
- Engineering overhead and added latency; benefits must outweigh complexity
- Sustainability and policy impact at scale
- Sectors: cloud sustainability, AI governance, ESG reporting
- What: Lower memory footprints and higher batch utilization reduce energy per token when step counts are amortized at scale.
- Tools/workflows:
- Energy/performance dashboards tracking NFE, batch size, VRAM, throughput
- Procurement guidance favoring architectures with logarithmic output scaling
- Assumptions/dependencies:
- Net energy savings depend on step count vs. AR baselines; requires rigorous measurement in production setups
Notes on feasibility across applications:
- The bitstream approach depends critically on reliable, reversible token-to-bit codecs. Quality and robustness of these codecs will affect performance in specialized domains and across languages.
- Iterative denoising remains a latency bottleneck for interactive scenarios; step-reduction (via distillation/consistency models) is the key dependency for many long-term applications.
- The reported quality gains are at moderate scale. Demonstrating persistence of the GenPPL–entropy frontier at larger model sizes and broader tasks will be necessary for widespread adoption.
Glossary
- Absorbing discrete diffusion: A discrete diffusion process that includes an absorbing state to simplify denoising dynamics. "We evaluate this option using a SEDD-style absorbing discrete diffusion baseline [Lou et al., 2024] on LM1B bitstreams."
- AdaLN-zero: A zero-initialized adaptive LayerNorm conditioning mechanism used to inject timestep or conditioning signals. "RoPE, AdaLN-zero time conditioning, SwiGLU activations, FlashAttention/SDPA kernels when available, dropout 0.1, and BF16 training."
- Analog Bits: A method that represents discrete variables as binary bits to enable continuous diffusion over “analog” bit values. "Analog Bits [Chen et al., 2023] represents discrete variables as binary bits and trains continuous diffusion models on analog versions of those bits; it also introduced self-conditioning and asymmetric time intervals, both of which influence our design."
- Asymmetric time-interval label shift: A sampling trick that evaluates the denoiser at a slightly noisier time label to improve low-step performance. "We also support the asymmetric time-interval label shift of Analog Bits [Chen et al., 2023], which evaluates the denoiser at a slightly noisier time label and can help in some low-NFE deterministic regimes."
- Autoregressive: A sequential generation paradigm that predicts each token conditioned on previously generated tokens. "Autoregressive LLMs dominate modern text generation because they define a simple factorization and scale reliably."
- Bregman divergence: A general divergence measure often used to define training objectives, here connecting flow matching to embedding-space DLMs. "LangFlow connects embedding-space DLMs to flow matching via Bregman divergence, introduces an ODE-based NLL bound, proposes an information-uniform noise-scheduling principle, and shows that self-conditioning improves continuous DLMs."
- Bitstream Diffusion: Modeling language as diffusion over fixed-width binary bitstreams rather than token embeddings. "Figure 1: End-to-end Bitstream Diffusion architecture."
- CANDI: A hybrid discrete–continuous diffusion framework for categorical data. "CANDI [Pynadath et al., 2025] explores hybrid discrete-continuous diffusion."
- Carry-mode self-conditioning: Feeding the model’s previous denoised prediction into the next step at sampling time. "At sampling time, we use carry-mode self-conditioning, where the previous denoised prediction is fed into the next denoising step."
- CDF (cumulative distribution function): The cumulative probability function; here used to place sigma-grid points uniformly in a target density. "For a grid uniform in the CDF of Ta (u), local spacing satisfies"
- D3PM: A structured denoising diffusion model for discrete state spaces. "D3PM [Austin et al., 2021] introduced structured discrete denoising diffusion,"
- DDIM: A deterministic diffusion sampler (Denoising Diffusion Implicit Models) that integrates the probability-flow ODE. "We use DDIM-style sampling in the main experiments and support Heun correction in the codebase."
- EDM (Elucidated Diffusion Models): A family of diffusion design choices (noise schedules, churn) and loss weighting. "We adopt an EDM-style stochastic churn [Karras et al., 2022a]."
- Entropic integration viewpoint: Constructing sigma grids by inverting an entropy-informed density for more informative step allocation. "For sampling, we construct a sigma grid by approximately inverting the CDF of Ta, following the entropic integration viewpoint of Stancevic et al. [2025]."
- Entropic time warping: A time-change that redistributes solver steps according to information/entropy to improve sampling efficiency. "A key component of our method is the use of entropic time warping, introduced in [Dieleman et al., 2022] for softmax models and generalized in [Stancevic et al., 2025] to arbitrary continuous diffusion models."
- Entropy-band stochastic sampling: Adding stochasticity specifically within entropy-active regions during sampling. "Our contribution is closest in representation to Analog Bits but differs in scale, architecture, language-focused evaluation, matched-filter resid- ual parameterization, entropy-rate scheduling, and entropy-band stochastic sampling."
- Entropy-CDF windows: Intervals in the entropy CDF used to localize where stochastic correction is applied. "Narrow entropy-CDF windows are sensitive to their location, whereas broad windows are consistently stronger;"
- Entropy-gated: Stochastic corrections whose effective strength is automatically modulated by the entropy profile. "This is the mechanism behind the title phrase "entropy-gated.""
- Entropy-rate adaptive noise allocation: A training-time strategy that allocates noise levels according to the measured rate of entropy production. "3.4. Entropy-rate adaptive noise allocation"
- Entropy-rate grid: A sigma sampling grid derived from the entropy-rate density to focus steps where information is resolved. "When applied on an entropy-rate sampling grid, full-band churn improves the GenPPL, entropy frontier without changing the trained model or increasing the NFE budget."
- Entropy-rate profile: The rate at which the forward process destroys information per unit log-noise. "Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile,"
- FlashAttention: An optimized attention algorithm/kernel that reduces memory and improves throughput. "RoPE, AdaLN-zero time conditioning, SwiGLU activations, FlashAttention/SDPA kernels when available, dropout 0.1, and BF16 training."
- Flow matching: A framework that trains a vector field so that its ODE transports noise to data. "LangFlow connects embedding-space DLMs to flow matching via Bregman divergence,"
- Gaussian posterior-mean identity: An identity that relates posterior means to the score in Gaussian corruption models. "These probabilities induce a continuous score estimate through the Gaussian posterior-mean identity"
- Generative perplexity (GenPPL): An external-LM-based metric for sample quality; lower is better. "they often yield weak sample quality, or achieve artificially low generative perplexity (GenPPL) only by over- generating safe, frequent tokens, thereby collapsing sample entropy."
- Heun correction: A second-order numerical correction (predictor–corrector) used for ODE/SDE solvers. "We use DDIM-style sampling in the main experiments and support Heun correction in the codebase."
- I-MMSE relation: A relationship connecting mutual information and minimum mean-squared error, used to estimate entropy rates. "By the I-MMSE relation and denoising-score-matching identities, this rate is related to a noise- rescaled denoising error."
- Karras grid: A sigma schedule/grid proposed in EDM that often improves diffusion sampling. "The entropy-rate grid improves deterministic sampling relative to the Karras grid and makes stochastic churn substantially more effective (Figure 2)."
- Langevin correction: A stochastic update akin to Langevin dynamics that corrects over-contraction in deterministic flows. "the churn update behaves like a probability-flow step plus a local Langevin correction with effective reverse-SDE strength"
- Matched-filter residual parameterization: Adding an analytic independent-bit posterior logit (matched filter) to a learned contextual residual. "we introduce a matched-filter residual parameterization: the network analytically computes the independent-bit posterior and focuses its capacity entirely on predicting the contextual residual."
- MDLM: Masked Diffusion LLM, a discrete diffusion baseline for language. "MDLM [Sahoo et al., 2024] showed that masked diffusion language modeling can be substantially strengthened with a simplified objective and improved training recipe."
- Neural function evaluations (NFEs): The number of denoiser network calls required during sampling. "using 256 neural function evaluations (NFEs)"
- ODE-based NLL bound: A bound on negative log-likelihood derived from the probability-flow ODE. "LangFlow connects embedding-space DLMs to flow matching via Bregman divergence, introduces an ODE-based NLL bound,"
- Pareto frontier: The set of non-dominated trade-offs between competing objectives (here, GenPPL and entropy). "our stochastic sampler establishes a new continuous-DLM Pareto frontier,"
- Probability-flow ODE: The deterministic ODE whose trajectories follow the mean of the reverse SDE. "The deterministic sampler integrates the probability-flow ODE induced by Equation (2)."
- Probability-flow sampler: A deterministic sampler that integrates the probability-flow ODE instead of simulating the SDE. "The deterministic probability-flow sampler is already competitive with recent continuous DLMs, but it is over-contractive:"
- RDLM: Riemannian Diffusion LLMs that use statistical manifold geometry for categorical modeling. "Riemannian Diffusion LLMs (RDLM) model categorical distributions using statistical-manifold geometry [Jo and Hwang, 2025]."
- RoPE: Rotary positional embeddings, a positional encoding scheme for transformers. "We use local Fourier features within each bit patch, RoPE in the Transformer trunk, and no absolute global Fourier features in the final configuration."
- Score matching: A training objective that fits the score (gradient of log-density), here implemented as denoising MSE. "In our binary setting, score matching is natural because De directly defines the continuous score in Equation (2)."
- SEDD: A discrete diffusion approach using ratio estimation and score entropy. "SEDD [Lou et al., 2023] framed discrete diffusion through ratio estimation and score entropy."
- Self-conditioning: Feeding the model’s own previous denoised output back as an input to stabilize and improve training/sampling. "We use self-conditioning by default."
- Semantic bit-patching: Grouping m bits per token into a patch to operate the transformer at semantic length T. "By predicting O(log V) bitwise logits via semantic bit-patching, our model yields a reduced memory footprint and higher throughput,"
- Sequence Diffusion Transformer (SDT): The transformer denoiser used for sequence-level diffusion over bit patches. "our sequence diffusion transformer (SDT) preserves the semantic context length T while replacing the dense vocabulary classifier with a compact O(log V) bitwise head."
- Simplex: The probability simplex; a geometric representation of categorical distributions used by some continuous DLMs. "Simplex, one-hot, and discrete- transition models fundamentally require O(V) output parameterizations per token."
- Stochastic churn: An EDM sampling technique that briefly increases noise before each step to add controlled stochasticity. "We adopt an EDM-style stochastic churn [Karras et al., 2022a]."
- SwiGLU: A gated activation function variant used in transformer feed-forward layers. "RoPE, AdaLN-zero time conditioning, SwiGLU activations, FlashAttention/SDPA kernels when available, dropout 0.1, and BF16 training."
- Token-frequency entropy: The unigram entropy of generated tokens, used to detect collapse or over-contraction. "we evaluate GenPPL jointly with token-frequency entropy."
- Variance-exploding Gaussian corruption: A forward diffusion process where noise variance increases with time (sigma). "As forward process, we use a variance-exploding Gaussian corruption model"
- Vocabulary-sized output boundary: An output head that scales with vocabulary size V (O(TV)), creating memory and compute bottlenecks. "This representation removes the vocabulary-sized output boundary used by most DLMs."
Collections
Sign up for free to add this paper to one or more collections.