Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation (2505.18853v1)

Published 24 May 2025 in cs.CL

Abstract: Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at https://github.com/ashaba1in/smoothie.

Summary

The paper introduces Smoothie, a novel diffusion model that balances discrete token representations with semantic information through smoothing on embedding distances.
The methodology leverages fixed pre-trained BERT embeddings and a custom noise scheduler to gradually add and then remove noise in the token embedding space.
Experimental results on tasks such as paraphrase generation and summarization show that Smoothie outperforms existing diffusion models and competes with autoregressive approaches.

This paper introduces Smoothie, a novel diffusion model for text generation that aims to combine the strengths of existing approaches by effectively handling both the discrete nature of text and the semantic relationships between tokens.

The core challenge in applying diffusion models (highly successful in continuous domains like images) to text is its discrete nature. Previous methods either apply Gaussian diffusion in continuous latent embedding spaces (good for semantics, bad for decoding back to tokens) or operate in categorical/simplex spaces (good for discreteness, bad for semantics). Smoothie proposes a middle ground.

Smoothie's Approach: Smoothing Diffusion on Token Embeddings

Latent Space Construction:
- Each token $w_i^y$ in a target sequence is represented by a vector $\mathbf{D}_0$ . This vector consists of negative squared Euclidean distances between the embedding of $w_i^y$ (denoted $\mathbf{E}_{w^y_i}$ ) and the embeddings of all tokens in the vocabulary $\mathbf{E}_j$ .
- Specifically, for a sequence of length $m$ and vocabulary size $V$ , $\mathbf{D}_0$ is an $m \times V$ matrix where each element $(i,j)$ is:
  
  $(\mathbf{D}_0)_{i,j} = -\frac{\|\mathbf{E}_{w^y_i} - \mathbf{E}_j\|^2}{2}$

* This representation inherently captures semantic similarity: tokens with closer embeddings will have smaller negative distances. The paper uses fixed, pre-trained BERT embeddings.

Forward Diffusion Process (Noising):
- A non-Markovian process gradually adds noise to $\mathbf{D}_0$ to get $\mathbf{D}_t$ :
  
  $q(\mathbf{D}_t | \mathbf{D}_0) = \mathcal{N}\left(\mathbf{D}_t \bigg| \frac{1}{\sigma_t^2}\mathbf{D}_0, \delta^2 I\right)$

* $\sigma_t$ is a noise scheduler ( $1 < \sigma_1 < \dots < \sigma_T$ ) that controls the "bandwidth" of a Gaussian kernel. As $\sigma_t$ increases, information is progressively smoothed out. * $\delta$ controls the stochasticity of the forward process (kept constant and set to 1 during training). * The model input at timestep $t$ is $\mathbf{p}_t = \operatorname{softmax}(\mathbf{D}_t)$ . This can be interpreted as a Nadaraya-Watson kernel estimator over all vocabulary embeddings, where $\sigma_t$ defines the kernel bandwidth. As $\sigma_t$ increases, probability mass spreads from the original token to semantically similar ones, then to dissimilar ones. * The paper notes this formulation generalizes simplex-based diffusion, which can be seen as using a trivial distance metric.

Reverse Diffusion Process (Denoising):
- A neural network is trained to reverse the noising process. Instead of directly predicting the complex $\mathbf{D}_0 / \sigma_t^2$ , the paper leverages a key insight (Theorem 4.1):
- Theorem 4.1: Optimizing for $\|\mathbf{D}_0(\mathbf{E}_{\mathbf{w}^y}) - g_{\theta}(\mathbf{p}_t, t)\|^2$ is equivalent (up to a constant) to optimizing for $\|\mathbf{E}_{\mathbf{w}^y} - f_{\theta}(\mathbf{p}_t, t)\|^2$ , where $\mathbf{D}_0(f^*(\mathbf{p}_t, t))$ is the optimal $g^*$ .
- Thus, the model $f_\theta$ is trained to predict the original token embeddings $\mathbf{E}_{\mathbf{w}^y}$ from the noisy input $\mathbf{p}_t$ and timestep $t$ :
  
  $L_{\mathbf{E}(\theta)} = \mathbb{E}_{\mathbf{w}^y, t, \mathbf{p}_t} \left[\|\mathbf{E}_{\mathbf{w}^y} - f_{\theta}(\mathbf{p}_t, t)\|^2 \right]$

* During sampling, starting from $\mathbf{D}_T \sim \mathcal{N}(0, \tilde{\delta}^2 I)$ , the model iteratively refines $\mathbf{D}_t$ to $\mathbf{D}_{t-1}$ using:

$\mathbf{D}_{t-1} = \frac{1}{\sigma_{t-1}^2} \mathbf{D}_0(f_{\theta}(\mathbf{p}_t, t)) + \tilde{\delta} \varepsilon$

where $\mathbf{D}_0(f_{\theta}(\mathbf{p}_t, t))$ is constructed using the predicted embeddings from $f_{\theta}$ . $\tilde{\delta}$ is the noise standard deviation for the reverse process, allowing control over generation stochasticity.

Noise Scheduler:
- A custom scheduler is used for $\sigma_t$ , designed to add more noise at early stages:
  
  $\sigma_t = (\sigma_{\text{max}} - \sigma_{\text{min}}) \frac{2}{\pi} \arctan\left(\frac{1}{d} \sqrt{\frac{t}{T - t + \epsilon}}\right) + \sigma_{\text{min}}$

* Typical values: $\sigma_{\text{min}} = 1.5, \sigma_{\text{max}} = 200, d = 9$ .

Decoding:
- Once $\mathbf{D}_0$ is obtained, tokens are decoded by selecting the vocabulary item with the highest value in each row of $\mathbf{D}_0$ : $\operatorname{argmax}(\mathbf{D}_0)$ . This is equivalent to finding the closest token embedding if a complex decoder isn't used. The paper finds that a simple argmax decoder works well, and a more complex Transformer-based decoder offers negligible impact.

Implementation Details:

Embeddings: Pre-trained, fixed BERT embeddings are used, normalized to zero mean and unit variance.
Model Input: The model $f_\theta$ doesn't directly process $\mathbf{p}_t$ . Instead, it takes a weighted average of token embeddings: $\mathbf{p}_t \mathbf{E}$ , making the input lower-dimensional.
Architecture: Based on TEncDM (2402.19097), using Transformer decoder layers with UNet-style skip connections (12 layers, ~100M params). For conditional tasks, a 6-layer Transformer encoder processes the source sequence, its output integrated via cross-attention. Timestep embeddings are added to each Transformer block.
Hyperparameter $\tilde{\delta}$ : Crucial for balancing generation quality and diversity. Lower $\tilde{\delta}$ (e.g., 0.25 for seq-to-seq) yields better perplexity but lower diversity. Higher $\tilde{\delta}$ might be better for unconditional generation.
Sequence Length: Sequences are padded to a fixed maximum length (dataset-specific, ~99th percentile).

Experimental Evaluation:

Tasks and Datasets:
- Paraphrase Generation: Quora Question Pairs (QQP)
- Question Generation: Quasar-T
- Text Simplification: Newsela-Auto
- Summarization: XSum
- Unconditional Generation (for $\tilde{\delta}$ ablation): ROCStories
Metrics: BLEU, ROUGE-1/2/L, BERTScore (BS), n-gram diversity (Div-1/4), SARI (for simplification), MAUVE.
Baselines:
- Diffusion: DiffuSeq, SeqDiffuSeq, SSD-LM, TESS, AR-Diffusion, GENIE.
- Autoregressive: BART, GPT-2, GPVAE-T5, FLAN-T5, standard Transformer.
- Ablation: Smoothie framework with embedding-space diffusion and simplex-space diffusion.
Results:
- Smoothie generally outperforms other diffusion-based models on most sequence-to-sequence tasks.
- It achieves competitive results compared to strong autoregressive models like BART.
- Ablation studies show Smoothie's distance-based latent space performs better than standard embedding space and categorical simplex space within the same architectural framework.
- Achieves the best mean rank across Quasar-T, Newsela-Auto, and QQP datasets compared to other diffusion models.
- The number of denoising steps impacts performance, with more complex tasks (XSum, Newsela-Auto) benefiting from more steps (e.g., 200-500), while others are stable with fewer steps. Smoothie generally requires fewer steps (e.g., 200) than many competing diffusion methods.
- Self-conditioning (+SC) provided only marginal gains and was not used generally due to increased training time.

Contributions:

A novel text diffusion framework (Smoothie) that simultaneously respects text discreteness and progressively removes semantic information by smoothing based on embedding similarity.
Empirical evidence of Smoothie's effectiveness across multiple sequence-to-sequence tasks, outperforming existing diffusion methods and showing strong performance against autoregressive models.

Practical Implications:

Smoothie offers a new way to design diffusion models for text that effectively leverages semantic information from pre-trained embeddings during the diffusion process itself, not just as an input/output representation.
The method of representing tokens by distances to all vocabulary embeddings and then applying softmax to the noised distances provides a principled way to smooth information.
The finding that a simpler MSE loss on embeddings ( $L_{\mathbf{E}(\theta)}$ ) is effective and that complex decoders are not necessary simplifies implementation.
The hyperparameter $\tilde{\delta}$ provides a direct knob to control the trade-off between fluency/accuracy and diversity, which can be tuned per application.
The architecture, while Transformer-based, uses specific UNet-style skip connections, which might be important for performance.
The technique could potentially be extended to other categorical data domains where a meaningful distance metric between categories exists.

Limitations:

Relies on fixed pre-trained embeddings; end-to-end training might offer further improvements but could be less stable.
Operates on fixed-length sequences, requiring padding, which is computationally inefficient.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - ashaba1in/smoothie