A Cheaper and Better Diffusion Language Model with Soft-Masked Noise (2304.04746v1)

Published 10 Apr 2023 in cs.CL and cs.AI

Abstract: Diffusion models that are based on iterative denoising have been recently proposed and leveraged in various generation tasks like image generation. Whereas, as a way inherently built for continuous data, existing diffusion models still have some limitations in modeling discrete data, e.g., languages. For example, the generally used Gaussian noise can not handle the discrete corruption well, and the objectives in continuous spaces fail to be stable for textual data in the diffusion process especially when the dimension is high. To alleviate these issues, we introduce a novel diffusion model for LLMing, Masked-Diffuse LM, with lower training cost and better performances, inspired by linguistic features in languages. Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data. Also, we directly predict the categorical distribution with cross-entropy loss function in every diffusion step to connect the continuous space and discrete space in a more efficient and straightforward way. Through experiments on 5 controlled generation tasks, we demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.

PDF Abstract

Diffusion models, which excel at generating continuous data like images, face unique challenges when applied to discrete data modalities such as text. Standard approaches often rely on adding Gaussian noise uniformly to token embeddings and employing less stable methods like k-nearest-neighbor (kNN) rounding to bridge the continuous latent space back to discrete tokens. The paper "A Cheaper and Better Diffusion LLM with Soft-Masked Noise" (Chen et al., 2023 ) introduces Masked-Diffuse LM, a novel diffusion model designed specifically for language generation that addresses these limitations through a linguistic-informed forward process and a more direct continuous-to-discrete mapping.

The core ideas of Masked-Diffuse LM are:

Linguistic-Informed Soft-Masking Forward Process: Instead of uniform Gaussian noise, the model employs a structured corruption process inspired by linguistic features. Words are assigned an "importance" score based on a combination of TF-IDF weights and entropy. This allows the model to strategically add soft-masked noise to more important words earlier in the diffusion process. The intuition is that masking harder/more informative words first encourages the model to learn to generate easier/less informative words initially during the reverse process, following an "easy-first" generation strategy observed in human language planning. This structured corruption is applied gradually over $T$ diffusion steps, organized into $m$ buckets based on word importance. Noise is added to the hidden representation $w_i$ of word $i$ at step $t$ if word $i$ belongs to the bucket $W_{|\frac{tm}{T}|}$ . The noise is Gaussian, with a square-root noise schedule $\beta_t = 1-\sqrt{t/T+s}$ to control the amount of noise.
Direct Continuous-to-Discrete Mapping with Cross-Entropy Loss: In the reverse (denoising) process, instead of minimizing the $L_2$ distance between intermediate continuous latent variables and the original $X_0$ , Masked-Diffuse LM directly predicts the categorical distribution over the vocabulary using a linear layer applied to the denoised hidden states $\hat{X}_{t-1}$ . The model is trained using a weighted cross-entropy loss $\mathcal{L}_t$ at each step $t \in (0, T]$ . The loss minimizes the cross-entropy between the predicted token distribution $f(\hat{X}_{t-1})$ and both the original sentence $d$ and the masked sentence $\hat{d}$ at step $t-1$ . A weighting factor $\gamma_t = \frac{T-t}{T}$ is applied to the loss term minimizing cross-entropy with the original sentence, giving higher weight to tokens that were recently masked in the forward process. This approach connects the continuous latent space directly to the discrete token space in a more stable and efficient manner than kNN rounding, especially for high-dimensional embeddings.
Integration with Pre-trained LLMs (PLMs): The architecture allows for easy integration of PLMs like BERT. The embedding layer can be replaced with a frozen PLM encoder to obtain initial high-dimensional word representations. The linear layer for predicting the discrete tokens can utilize the PLM's final layer weights. The direct cross-entropy loss formulation is particularly beneficial here, as it avoids the performance degradation seen with methods like kNN rounding when working with the high-dimensional embeddings from large PLMs. The model parameters learned during training focus on the transition model $\theta$ (the Transformer) and potentially the final linear prediction layer, keeping the PLM weights fixed for efficiency.

Implementation Details and Practical Application:

The Masked-Diffuse LM uses a Transformer network (e.g., 80M parameters) for the transition model $p_\theta(X_{t-1}|X_t, t)$ . The input tokens are mapped to continuous embeddings $X_0$ via a learnable embedding layer or a PLM encoder. The forward process then generates $X_{1:T}$ by gradually adding soft-masked noise according to the calculated word importances. The reverse process iteratively denoises $X_T$ back to an approximation of $X_0$ , predicting token probabilities at each step using a linear layer $f(\cdot)$ and the vocabulary size.

For controllable text generation, the paper follows a plug-and-play approach. During the denoising process, gradient updates are applied to the intermediate latent variables $X_t$ to optimize both the likelihood under the diffusion model $p(X_{t-1}|X_t)$ and the objective from an external classifier $p(c|X_{t-1})$ trained to predict the control attribute $c$ . A fluency regularization hyperparameter $\lambda$ balances these two objectives.

The model was evaluated on the E2E dataset [W17-5525] across five controlled generation tasks: Semantic Content, Parts-of-speech, Syntax Tree, Syntax Spans, and Length.

Key Findings and Practical Implications:

Improved Performance: Masked-Diffuse LM consistently outperforms previous diffusion-based and non-diffusion baselines (PPLM [Dathathri2020Plug], FUDGE [yang-klein-2021-fudge], Diffusion-LM (Li et al., 2022 )) on various controlled generation tasks (Table 1), measured by task accuracy and generation fluency (perplexity). The improvements are particularly notable when combined with BERT, where Diffusion-LM saw performance drops, highlighting Masked-Diffuse LM's better compatibility with high-dimensional PLM embeddings.
Enhanced Efficiency: Masked-Diffuse LM demonstrates significantly lower training and inference times compared to Diffusion-LM (Table 2). This is attributed to the more stable noise process and the efficient cross-entropy objective, which avoids the costly kNN search. This makes the model more practical for real-world deployment and training.
Effective Noise Strategy: Ablation studies (Table 4) confirm that the linguistic-informed soft-masking strategy (combining Entropy and TF-IDF) yields better performance than Gaussian noise or random masking, validating the importance of incorporating linguistic structure into the noise process.
Superior Objective Function: Replacing the $L_2$ objective used in prior work with the cross-entropy loss improves performance, especially when integrating BERT (Table 5). This indicates that directly predicting discrete tokens is a more effective way to handle the discrete nature of text within a continuous diffusion framework.
Easy-First Generation: Case studies (Table 6) qualitatively show that the denoising process tends to generate simpler words first, followed by more informative ones, aligning with the design of the forward process and potentially contributing to better generation quality.

Overall, Masked-Diffuse LM offers a more efficient and effective approach to applying diffusion models to text generation, particularly for controlled generation tasks. Its key innovations lie in adapting the diffusion process to the discrete nature of language through structured noise and a direct prediction objective, enabling better performance and efficiency, and seamless integration with powerful pre-trained LLMs. The code is available at [https://github.com/amazon-science/masked-diffusion-lm].

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jiaao Chen (31 papers)
Aston Zhang (48 papers)
Mu Li (95 papers)
Alex Smola (46 papers)
Diyi Yang (151 papers)

Citations (15)

View on Semantic Scholar

A Cheaper and Better Diffusion Language Model with Soft-Masked Noise (2304.04746v1)

Related Papers