NeoDiff Framework: Non-Simultaneous Diffusion
- NeoDiff is a unified text diffusion framework that combines discrete token corruption with continuous noise perturbation for fine-grained control.
- It employs a bi-temporal diffusion process with extrinsic and per-token intrinsic times, driven by a stochastic Poisson process to guide noising and denoising.
- Empirical evaluations demonstrate improved BLEU scores and generation quality across translation and paraphrasing tasks with efficient Bayesian schedule optimization.
NeoDiff—Non-simultaneous Continuous Diffusion Models—constitutes a unified framework for text generation via diffusion, integrating the flexibility of discrete token-level corruption with the fine granularity of continuous noise perturbation. The framework is characterized by a bi-temporal diffusion process wherein each token receives a stochastic, Poisson-driven amount of noise independently, yet the noising and denoising remain mathematically coupled. It introduces per-token intrinsic time, systematizes the denoising schedule via a learned time predictor, and optimizes inference by Bayesian selection of extrinsic diffusion times. Experimental evaluations demonstrate that NeoDiff achieves superior or comparable generation quality across machine translation, paraphrasing, simplification, and question generation benchmarks, while incurring only modest computational overhead relative to relevant baselines (Li et al., 28 May 2025).
1. Conceptual Foundations and Motivation
Recent advances in diffusion models for text generation center around discrete and continuous paradigms. Discrete diffusion corrupts each token via categorical processes such as masking or replacement, enabling tokenwise independence but producing discontinuous transitions. Continuous diffusion maps tokens to real-valued vectors and perturbs them by additive Gaussian noise, permitting fine-grained control but enforcing uniformity in noise application across all tokens at each diffusion time step. NeoDiff was constructed to resolve the dichotomy: it aims to combine per-token independence (as in discrete models) with continuous, fine-grained noising (as in continuous models) without the coarse limitations of previous approaches.
NeoDiff achieves this integration by introducing a bi-temporal system, with extrinsic time controlling the overall diffusion process, and per-token intrinsic time governed by a Poisson arrival process. This framework enables non-simultaneous, tokenwise corruption of varying magnitudes, thus allowing for both independence and inter-token semantic interaction during denoising (Li et al., 28 May 2025).
2. Forward (Noising) Process
The NeoDiff forward process extends standard diffusion by jointly modeling token embeddings and their intrinsic noise stages. For each token, an intrinsic state increments by unitary steps following a Poisson rate :
In the infinitesimal limit, with . Per-token intrinsic times are computed as and rescaled by a variance-controlled mechanism to preserve stepped discreteness for arbitrary max. The forward (noising) distribution is factorized as:
where denotes Gaussian noising:
with monotonic schedules satisfying , , , .
This process enables each token to progress along the diffusion path at a stochastically different, yet variance-stabilized, pace, preserving independent trajectories while retaining continuous, fine-level control (Li et al., 28 May 2025).
3. Reverse (Denoising) Process and Time Predictor
The generative (reverse) process is formulated as:
Each reverse step is decomposed as:
Here, is implemented via a Transformer-based decoder that predicts the denoised embedding , while —the time predictor—is a compact classifier estimating the next per-token intrinsic time based on the current state and, optionally, context encoding .
Pseudo-labels for the time predictor are constructed by ranking per-token denoising confidence (combined embedding and anchor loss), mapping these normalized ranks through the Poisson inverse CDF, and applying variance-controlled clipping. This mechanism focuses predictor supervision on the subset of tokens most challenging to denoise, aligning the model’s learning trajectory with the per-token semantic structure of each sequence (Li et al., 28 May 2025).
4. Training Objective and Optimization
NeoDiff is trained with a composite objective, minimizing a variational lower bound supplemented by an anchor loss to prevent embedding collapse. The objective function is:
The anchor term enforces consistency with gold or conditioning tokens via self-conditioning cross-entropy: anchor .
All major schedules (, , and ) are precomputed. Training involves Monte Carlo sampling of extrinsic and intrinsic times, stochastic re-embedding, and backpropagation through both decoder and predictor (Li et al., 28 May 2025).
5. Inference Schedule and Bayesian Optimization
During inference, the model deviates from the uniformly incremented diffusion steps used in training. Instead, a subset of extrinsic time points is selected (with ) to reduce sampling complexity while preserving generation quality. The selection of this time schedule is conducted via Gaussian-process Bayesian optimization, with the acquisition function (e.g., GP-Hedge) optimized for validation BLEU or a task-relevant metric. Typically, up to 100 rounds of optimization are run, and the maximally performing set is adopted for sampling (Li et al., 28 May 2025).
Inference proceeds by initializing all tokens at maximally diffused states, then iteratively applying the time predictor and decoder to each scheduled time step, culminating in token retrieval via nearest-neighbor or argmax search in embedding space.
6. Empirical Evaluation
NeoDiff has been evaluated on IWSLT’14 De→En, WMT’14 En→De, WMT’16 En→Ro (machine translation), QQP (paraphrasing), Wiki-Auto (text simplification), and Quasar-T (question generation). Core findings include:
- On WMT’14 En→De, BLEU (beam=1/10): NeoDiff attains 24.41/25.28, outperforming Difformer (23.80/23.26) and SeqDiffuSeq (23.63/24.24).
- On QQP, single-sample BLEU: NeoDiff 29.47 versus SeqDiffuSeq 23.28 and Difformer 28.52.
- LLM-based human evaluation using DeepSeek-V3 shows improvements in semantic faithfulness and completeness on QQP, with parity or superiority in translation metrics (accuracy, fluency, creativity).
- Ablation studies on IWSLT’14 De→En: baseline continuous diffusion 32.09 BLEU; with Poisson process 32.75; adding time predictor 32.97; full model with optimized inference schedule 33.14, yielding a cumulative gain of +1.05 BLEU.
- Inference efficiency (IWSLT’14 De→En): NeoDiff (K=20) achieves 5.12 sentences per second and 2.08 GB RAM, with negligible overhead compared to Difformer.
Performance improvements of 0.5–1.1 BLEU over state-of-the-art baselines are reported for non-autoregressive continuous, discrete, and hybrid iterative methods, demonstrating the practical benefit of the unified, non-simultaneous noise scheduling and Bayesian schedule optimization (Li et al., 28 May 2025).
7. Synthesis and Implications
NeoDiff defines a new paradigm in text diffusion by formalizing the notion of non-simultaneous, per-token continuous noising governed by a stochastic Poisson process. Its bi-temporal architecture and optimized inference schedule integrate the discrete and continuous perspectives, facilitating context-sensitive denoising and yielding notable gains in empirical text generation quality. The adoption of variance-controlled, token-specific intrinsic times enables a spectrum of semantic guidance and fine-grained control not achievable in previous paradigms. A plausible implication is that the NeoDiff framework can serve as a foundation for further advances in controllable, flexible generative modeling for text, especially in settings where token-level semantic structure plays a central role (Li et al., 28 May 2025).