Gradient-Based Constrained Sampling from Language Models (2205.12558v2)

Published 25 May 2022 in cs.CL

Abstract: Large pretrained LLMs generate fluent text but are notoriously hard to controllably sample from. In this work, we study constrained sampling from such LLMs: generating text that satisfies user-defined constraints, while maintaining fluency and the model's performance in a downstream task. We propose MuCoLa -- a sampling procedure that combines the log-likelihood of the LLM with arbitrary (differentiable) constraints in a single energy function, and then generates samples in a non-autoregressive manner. Specifically, it initializes the entire output sequence with noise and follows a Markov chain defined by Langevin Dynamics using the gradients of the energy function. We evaluate MuCoLa on text generation with soft and hard constraints as well as their combinations obtaining significant improvements over competitive baselines for toxicity avoidance, sentiment control, and keyword-guided generation.

Citations (42)

View on Semantic Scholar

Summary

The paper presents MUCOLA, which employs Langevin Dynamics to sample in embedding space for constrained text generation.
It formulates the problem using a Lagrangian energy function that balances LM likelihood with multiple differentiable constraints.
Experiments show improved toxicity avoidance, sentiment control, and keyword-guided generation while preserving fluency and diversity.

This paper introduces MUCOLA (Multiple Constraints Sampling from LLMs using Langevin Dynamics), a non-autoregressive algorithm for generating text from LLMs (LMs) that satisfies user-defined constraints while maintaining fluency and fidelity to the original LM distribution.

The core problem addressed is the difficulty in controlling large pretrained LMs like GPT-2 to generate text with specific properties (e.g., non-toxic, positive sentiment, containing certain keywords) without compromising the quality of the generated text. Autoregressive methods often struggle with global constraints and can alter the underlying LM distribution.

MUCOLA tackles this by framing constrained generation as sampling from an energy function that combines the LM's log-likelihood with differentiable constraint functions. Instead of optimizing for a single best output, it uses Langevin Dynamics, a gradient-based Markov Chain Monte Carlo (MCMC) method, to draw diverse samples.

Key Technical Contributions and Implementation Details:

Sampling in Embedding Space: Instead of representing tokens as distributions over the vocabulary ( $R^{|V|}$ $R^{∣ V ∣}$ ), MUCOLA operates directly on token embeddings ( $R^d$ $R^{d}$ , where $d \ll |V|$ $d ≪ ∣ V ∣$ ).
- An output sequence of length $L$ is represented as $\tilde{e} = \{e_1, \dots, e_L\}$ , where each $e_n \in E$ (the LM's embedding table).
- This significantly reduces the number of parameters from $L \times |V|$ to $L \times d$ , allowing for much longer sequences to be processed within GPU memory constraints.
- It provides a natural way to define certain constraints, particularly hard keyword constraints.
Langevin Dynamics: The sampling process iteratively updates the sequence embeddings $\tilde{e}$ using gradient descent steps modified with Gaussian noise:

$\tilde{e}^t = \text{Proj}_E(\tilde{e}^{t-1} - \eta \nabla_{\tilde{e}} \mathcal{E}(\tilde{e}^{t-1}) + \sqrt{2\eta\beta} z^t)$

* $\mathcal{E}(\tilde{e})$ is the energy function. * $\eta$ is the step size. * $\beta$ controls the noise variance (annealed over time). * $z^t \sim \mathcal{N}(0, I_d)$ is Gaussian noise. * $\text{Proj}_E(\cdot)$ projects the updated embedding back to the nearest embedding in the LM's table $E$ using Euclidean distance, preventing adversarial solutions. * The noise term allows the process to explore the solution space and escape local minima, generating diverse samples near the energy minimum.

Lagrangian Energy Function: The energy function $\mathcal{E}$ is defined using a Lagrangian formulation to handle multiple constraints $\{f_1, \dots, f_C\}$ with thresholds $\{\epsilon_1, \dots, \epsilon_C\}$ :

$\mathcal{E}(\tilde{e}) = -\log P(\tilde{e}|x) - \sum_{i=1}^C \lambda_i (\epsilon_i - f_i([x], \tilde{e}))$

* The goal is to sample $y \sim P(y|x)$ subject to $f_i([x], y) \le \epsilon_i$ for all $i$ . * $\lambda_i \ge 0$ are Lagrangian multipliers, dynamically updated via gradient ascent: $\lambda_i^t = \max(0, \lambda_i^{t-1} + \alpha \nabla_{\lambda_i} \mathcal{E})$ . * This avoids manual weight tuning for constraints. If a constraint $f_i \le \epsilon_i$ is satisfied, $\epsilon_i - f_i \ge 0$ , and $\lambda_i$ tends towards 0. If violated, $\epsilon_i - f_i < 0$ , and $\lambda_i$ increases, penalizing the violation more heavily. * Crucially, when all constraints are satisfied, $\mathcal{E}(\tilde{e}) \approx -\log P(\tilde{e}|x)$ , meaning the sampling process targets the original LM distribution.

Constraint Implementation: Constraints $f_i$ $f_{i}$ must be differentiable functions of the embeddings $\tilde{e}$ $\tilde{e}$ and share the LM's embedding table $E$ $E$ .
- Soft Constraints (Classifiers/LMs): Auxiliary models (e.g., RoBERTa classifiers, GPT-2 based generative classifiers) are fine-tuned using the target LM's embedding table (potentially frozen, with a projection layer if dimensions differ). Constraints are defined on probabilities, e.g., $p_{TOXIC}(\tilde{e}) < \epsilon$ or $\log p(\tilde{e}|LABEL1) > \log p(\tilde{e}|LABEL0)$ . Prompt-based classification without fine-tuning is also explored.
- Hard Constraints (Keywords): A differentiable distance function $d(w, \tilde{e})$ measures the minimum Euclidean distance between the target keyword embedding $e_w$ and any embedding $e_n$ in the sequence $\tilde{e}$ , using a Gumbel-Softmax trick for differentiable selection. The constraint is $d(w, \tilde{e}) < \epsilon_w$ , where the threshold $\epsilon_w \approx -\log \pi_{w,w}$ (related to the probability of the keyword under its own embedding's distribution) aims to guarantee the keyword's presence. This extends to phrases via convolutions.

Experiments and Applications:

MUCOLA was evaluated on several tasks using GPT2-Large/XL and BART-Large:

Toxicity Avoidance: Using a RoBERTa classifier constraint ( $p_{TOXIC} < 0.01$ ), MUCOLA matched or outperformed baselines (DAPT, FUDGE, GeDi, DExperts) in reducing toxicity while achieving perplexity closer to unconstrained generation and maintaining diversity. Human evaluation showed similar toxicity/fluency to DExperts but better topicality.
Sentiment Control: Using various classifiers (discriminative, generative, prompt-based) trained on SST-2 and Yelp, MUCOLA generally achieved good sentiment control with perplexity closer to the base LM than most baselines. Combining two classifiers (MUCOLA-TWO-DISC) performed strongly. Prompt-based constraints showed promise. Some degradation (repetition) was noted for very long sequences (length 50).
Keyword-Guided Generation (COMMONGEN, ROC): Using the hard keyword distance constraint, MUCOLA achieved state-of-the-art coverage (percentage of required keywords included) and significantly better perplexity than baselines like COLD and Neurologic, though human fluency scores were slightly lower than the best baseline.
Terminology-Constrained Machine Translation: Using a MarianMT model, MUCOLA achieved perfect terminology coverage while maintaining the BLEU score of unconstrained translation.
Entity-Constrained Summarization: Preliminary results using BART showed promise in forcing specific entities into summaries.

Analysis:

Speed/Memory: MUCOLA's embedding-space operations allow processing significantly longer sequences (e.g., 500-1000 tokens) compared to vocabulary-space methods (limited to ~20-50 tokens) on standard GPUs. However, it's still slower (15-20x) than autoregressive decoding due to iterative updates.
Diversity: Langevin Dynamics noise, not random initialization, is the main driver of sample diversity.
Controllability: The threshold $\epsilon_i$ provides interpretable control over constraint strength. Lowering $\epsilon$ tightens control without hurting fluency (perplexity) but might decrease diversity.
Compatibility: The framework struggles when constraints are fundamentally incompatible (e.g., generating positive text containing strongly negative keywords).

Limitations:

Slower than autoregressive methods.
Requires pre-specifying output length (though standard mitigation techniques exist).
Can inherit LM degeneracy issues (e.g., repetition) for long sequences.

In conclusion, MUCOLA provides a flexible and effective framework for constrained sampling from large LMs. By operating in the embedding space and using Langevin Dynamics with a Lagrangian objective, it generates diverse, fluent text that adheres to multiple soft or hard constraints while staying close to the original LM's distribution.

PDF Markdown

Gradient-Based Constrained Sampling from Language Models (2205.12558v2)

Summary

Related Papers