Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation (2505.18853v1)

Published 24 May 2025 in cs.CL

Abstract: Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at https://github.com/ashaba1in/smoothie.

Summary

  • The paper introduces Smoothie, a novel diffusion model that balances discrete token representations with semantic information through smoothing on embedding distances.
  • The methodology leverages fixed pre-trained BERT embeddings and a custom noise scheduler to gradually add and then remove noise in the token embedding space.
  • Experimental results on tasks such as paraphrase generation and summarization show that Smoothie outperforms existing diffusion models and competes with autoregressive approaches.

This paper introduces Smoothie, a novel diffusion model for text generation that aims to combine the strengths of existing approaches by effectively handling both the discrete nature of text and the semantic relationships between tokens.

The core challenge in applying diffusion models (highly successful in continuous domains like images) to text is its discrete nature. Previous methods either apply Gaussian diffusion in continuous latent embedding spaces (good for semantics, bad for decoding back to tokens) or operate in categorical/simplex spaces (good for discreteness, bad for semantics). Smoothie proposes a middle ground.

Smoothie's Approach: Smoothing Diffusion on Token Embeddings

  1. Latent Space Construction:
    • Each token wiyw_i^y in a target sequence is represented by a vector D0\mathbf{D}_0. This vector consists of negative squared Euclidean distances between the embedding of wiyw_i^y (denoted Ewiy\mathbf{E}_{w^y_i}) and the embeddings of all tokens in the vocabulary Ej\mathbf{E}_j.
    • Specifically, for a sequence of length mm and vocabulary size VV, D0\mathbf{D}_0 is an m×Vm \times V matrix where each element (i,j)(i,j) is:

      (D0)i,j=EwiyEj22(\mathbf{D}_0)_{i,j} = -\frac{\|\mathbf{E}_{w^y_i} - \mathbf{E}_j\|^2}{2}

* This representation inherently captures semantic similarity: tokens with closer embeddings will have smaller negative distances. The paper uses fixed, pre-trained BERT embeddings.

  1. Forward Diffusion Process (Noising):
    • A non-Markovian process gradually adds noise to D0\mathbf{D}_0 to get Dt\mathbf{D}_t:

      q(DtD0)=N(Dt1σt2D0,δ2I)q(\mathbf{D}_t | \mathbf{D}_0) = \mathcal{N}\left(\mathbf{D}_t \bigg| \frac{1}{\sigma_t^2}\mathbf{D}_0, \delta^2 I\right)

* σt\sigma_t is a noise scheduler (1<σ1<<σT1 < \sigma_1 < \dots < \sigma_T) that controls the "bandwidth" of a Gaussian kernel. As σt\sigma_t increases, information is progressively smoothed out. * δ\delta controls the stochasticity of the forward process (kept constant and set to 1 during training). * The model input at timestep tt is pt=softmax(Dt)\mathbf{p}_t = \operatorname{softmax}(\mathbf{D}_t). This can be interpreted as a Nadaraya-Watson kernel estimator over all vocabulary embeddings, where σt\sigma_t defines the kernel bandwidth. As σt\sigma_t increases, probability mass spreads from the original token to semantically similar ones, then to dissimilar ones. * The paper notes this formulation generalizes simplex-based diffusion, which can be seen as using a trivial distance metric.

  1. Reverse Diffusion Process (Denoising):
    • A neural network is trained to reverse the noising process. Instead of directly predicting the complex D0/σt2\mathbf{D}_0 / \sigma_t^2, the paper leverages a key insight (Theorem 4.1):
    • Theorem 4.1: Optimizing for D0(Ewy)gθ(pt,t)2\|\mathbf{D}_0(\mathbf{E}_{\mathbf{w}^y}) - g_{\theta}(\mathbf{p}_t, t)\|^2 is equivalent (up to a constant) to optimizing for Ewyfθ(pt,t)2\|\mathbf{E}_{\mathbf{w}^y} - f_{\theta}(\mathbf{p}_t, t)\|^2, where D0(f(pt,t))\mathbf{D}_0(f^*(\mathbf{p}_t, t)) is the optimal gg^*.
    • Thus, the model fθf_\theta is trained to predict the original token embeddings Ewy\mathbf{E}_{\mathbf{w}^y} from the noisy input pt\mathbf{p}_t and timestep tt:

      LE(θ)=Ewy,t,pt[Ewyfθ(pt,t)2]L_{\mathbf{E}(\theta)} = \mathbb{E}_{\mathbf{w}^y, t, \mathbf{p}_t} \left[\|\mathbf{E}_{\mathbf{w}^y} - f_{\theta}(\mathbf{p}_t, t)\|^2 \right]

* During sampling, starting from DTN(0,δ~2I)\mathbf{D}_T \sim \mathcal{N}(0, \tilde{\delta}^2 I), the model iteratively refines Dt\mathbf{D}_t to Dt1\mathbf{D}_{t-1} using:

Dt1=1σt12D0(fθ(pt,t))+δ~ε\mathbf{D}_{t-1} = \frac{1}{\sigma_{t-1}^2} \mathbf{D}_0(f_{\theta}(\mathbf{p}_t, t)) + \tilde{\delta} \varepsilon

where D0(fθ(pt,t))\mathbf{D}_0(f_{\theta}(\mathbf{p}_t, t)) is constructed using the predicted embeddings from fθf_{\theta}. δ~\tilde{\delta} is the noise standard deviation for the reverse process, allowing control over generation stochasticity.

  1. Noise Scheduler:
    • A custom scheduler is used for σt\sigma_t, designed to add more noise at early stages:

      σt=(σmaxσmin)2πarctan(1dtTt+ϵ)+σmin\sigma_t = (\sigma_{\text{max}} - \sigma_{\text{min}}) \frac{2}{\pi} \arctan\left(\frac{1}{d} \sqrt{\frac{t}{T - t + \epsilon}}\right) + \sigma_{\text{min}}

* Typical values: σmin=1.5,σmax=200,d=9\sigma_{\text{min}} = 1.5, \sigma_{\text{max}} = 200, d = 9.

  1. Decoding:
    • Once D0\mathbf{D}_0 is obtained, tokens are decoded by selecting the vocabulary item with the highest value in each row of D0\mathbf{D}_0: argmax(D0)\operatorname{argmax}(\mathbf{D}_0). This is equivalent to finding the closest token embedding if a complex decoder isn't used. The paper finds that a simple argmax decoder works well, and a more complex Transformer-based decoder offers negligible impact.

Implementation Details:

  • Embeddings: Pre-trained, fixed BERT embeddings are used, normalized to zero mean and unit variance.
  • Model Input: The model fθf_\theta doesn't directly process pt\mathbf{p}_t. Instead, it takes a weighted average of token embeddings: ptE\mathbf{p}_t \mathbf{E}, making the input lower-dimensional.
  • Architecture: Based on TEncDM (2402.19097), using Transformer decoder layers with UNet-style skip connections (12 layers, ~100M params). For conditional tasks, a 6-layer Transformer encoder processes the source sequence, its output integrated via cross-attention. Timestep embeddings are added to each Transformer block.
  • Hyperparameter δ~\tilde{\delta}: Crucial for balancing generation quality and diversity. Lower δ~\tilde{\delta} (e.g., 0.25 for seq-to-seq) yields better perplexity but lower diversity. Higher δ~\tilde{\delta} might be better for unconditional generation.
  • Sequence Length: Sequences are padded to a fixed maximum length (dataset-specific, ~99th percentile).

Experimental Evaluation:

  • Tasks and Datasets:
    • Paraphrase Generation: Quora Question Pairs (QQP)
    • Question Generation: Quasar-T
    • Text Simplification: Newsela-Auto
    • Summarization: XSum
    • Unconditional Generation (for δ~\tilde{\delta} ablation): ROCStories
  • Metrics: BLEU, ROUGE-1/2/L, BERTScore (BS), n-gram diversity (Div-1/4), SARI (for simplification), MAUVE.
  • Baselines:
    • Diffusion: DiffuSeq, SeqDiffuSeq, SSD-LM, TESS, AR-Diffusion, GENIE.
    • Autoregressive: BART, GPT-2, GPVAE-T5, FLAN-T5, standard Transformer.
    • Ablation: Smoothie framework with embedding-space diffusion and simplex-space diffusion.
  • Results:
    • Smoothie generally outperforms other diffusion-based models on most sequence-to-sequence tasks.
    • It achieves competitive results compared to strong autoregressive models like BART.
    • Ablation studies show Smoothie's distance-based latent space performs better than standard embedding space and categorical simplex space within the same architectural framework.
    • Achieves the best mean rank across Quasar-T, Newsela-Auto, and QQP datasets compared to other diffusion models.
    • The number of denoising steps impacts performance, with more complex tasks (XSum, Newsela-Auto) benefiting from more steps (e.g., 200-500), while others are stable with fewer steps. Smoothie generally requires fewer steps (e.g., 200) than many competing diffusion methods.
    • Self-conditioning (+SC) provided only marginal gains and was not used generally due to increased training time.

Contributions:

  1. A novel text diffusion framework (Smoothie) that simultaneously respects text discreteness and progressively removes semantic information by smoothing based on embedding similarity.
  2. Empirical evidence of Smoothie's effectiveness across multiple sequence-to-sequence tasks, outperforming existing diffusion methods and showing strong performance against autoregressive models.

Practical Implications:

  • Smoothie offers a new way to design diffusion models for text that effectively leverages semantic information from pre-trained embeddings during the diffusion process itself, not just as an input/output representation.
  • The method of representing tokens by distances to all vocabulary embeddings and then applying softmax to the noised distances provides a principled way to smooth information.
  • The finding that a simpler MSE loss on embeddings (LE(θ)L_{\mathbf{E}(\theta)}) is effective and that complex decoders are not necessary simplifies implementation.
  • The hyperparameter δ~\tilde{\delta} provides a direct knob to control the trade-off between fluency/accuracy and diversity, which can be tuned per application.
  • The architecture, while Transformer-based, uses specific UNet-style skip connections, which might be important for performance.
  • The technique could potentially be extended to other categorical data domains where a meaningful distance metric between categories exists.

Limitations:

  • Relies on fixed pre-trained embeddings; end-to-end training might offer further improvements but could be less stable.
  • Operates on fixed-length sequences, requiring padding, which is computationally inefficient.
Github Logo Streamline Icon: https://streamlinehq.com