Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gradient-Based Constrained Sampling from Language Models (2205.12558v2)

Published 25 May 2022 in cs.CL

Abstract: Large pretrained LLMs generate fluent text but are notoriously hard to controllably sample from. In this work, we study constrained sampling from such LLMs: generating text that satisfies user-defined constraints, while maintaining fluency and the model's performance in a downstream task. We propose MuCoLa -- a sampling procedure that combines the log-likelihood of the LLM with arbitrary (differentiable) constraints in a single energy function, and then generates samples in a non-autoregressive manner. Specifically, it initializes the entire output sequence with noise and follows a Markov chain defined by Langevin Dynamics using the gradients of the energy function. We evaluate MuCoLa on text generation with soft and hard constraints as well as their combinations obtaining significant improvements over competitive baselines for toxicity avoidance, sentiment control, and keyword-guided generation.

Citations (42)

Summary

  • The paper presents MUCOLA, which employs Langevin Dynamics to sample in embedding space for constrained text generation.
  • It formulates the problem using a Lagrangian energy function that balances LM likelihood with multiple differentiable constraints.
  • Experiments show improved toxicity avoidance, sentiment control, and keyword-guided generation while preserving fluency and diversity.

This paper introduces MUCOLA (Multiple Constraints Sampling from LLMs using Langevin Dynamics), a non-autoregressive algorithm for generating text from LLMs (LMs) that satisfies user-defined constraints while maintaining fluency and fidelity to the original LM distribution.

The core problem addressed is the difficulty in controlling large pretrained LMs like GPT-2 to generate text with specific properties (e.g., non-toxic, positive sentiment, containing certain keywords) without compromising the quality of the generated text. Autoregressive methods often struggle with global constraints and can alter the underlying LM distribution.

MUCOLA tackles this by framing constrained generation as sampling from an energy function that combines the LM's log-likelihood with differentiable constraint functions. Instead of optimizing for a single best output, it uses Langevin Dynamics, a gradient-based Markov Chain Monte Carlo (MCMC) method, to draw diverse samples.

Key Technical Contributions and Implementation Details:

  1. Sampling in Embedding Space: Instead of representing tokens as distributions over the vocabulary (RVR^{|V|}), MUCOLA operates directly on token embeddings (RdR^d, where dVd \ll |V|).
    • An output sequence of length LL is represented as e~={e1,,eL}\tilde{e} = \{e_1, \dots, e_L\}, where each enEe_n \in E (the LM's embedding table).
    • This significantly reduces the number of parameters from L×VL \times |V| to L×dL \times d, allowing for much longer sequences to be processed within GPU memory constraints.
    • It provides a natural way to define certain constraints, particularly hard keyword constraints.
  2. Langevin Dynamics: The sampling process iteratively updates the sequence embeddings e~\tilde{e} using gradient descent steps modified with Gaussian noise:

    e~t=ProjE(e~t1ηe~E(e~t1)+2ηβzt)\tilde{e}^t = \text{Proj}_E(\tilde{e}^{t-1} - \eta \nabla_{\tilde{e}} \mathcal{E}(\tilde{e}^{t-1}) + \sqrt{2\eta\beta} z^t)

* E(e~)\mathcal{E}(\tilde{e}) is the energy function. * η\eta is the step size. * β\beta controls the noise variance (annealed over time). * ztN(0,Id)z^t \sim \mathcal{N}(0, I_d) is Gaussian noise. * ProjE()\text{Proj}_E(\cdot) projects the updated embedding back to the nearest embedding in the LM's table EE using Euclidean distance, preventing adversarial solutions. * The noise term allows the process to explore the solution space and escape local minima, generating diverse samples near the energy minimum.

  1. Lagrangian Energy Function: The energy function E\mathcal{E} is defined using a Lagrangian formulation to handle multiple constraints {f1,,fC}\{f_1, \dots, f_C\} with thresholds {ϵ1,,ϵC}\{\epsilon_1, \dots, \epsilon_C\}:

    E(e~)=logP(e~x)i=1Cλi(ϵifi([x],e~))\mathcal{E}(\tilde{e}) = -\log P(\tilde{e}|x) - \sum_{i=1}^C \lambda_i (\epsilon_i - f_i([x], \tilde{e}))

* The goal is to sample yP(yx)y \sim P(y|x) subject to fi([x],y)ϵif_i([x], y) \le \epsilon_i for all ii. * λi0\lambda_i \ge 0 are Lagrangian multipliers, dynamically updated via gradient ascent: λit=max(0,λit1+αλiE)\lambda_i^t = \max(0, \lambda_i^{t-1} + \alpha \nabla_{\lambda_i} \mathcal{E}). * This avoids manual weight tuning for constraints. If a constraint fiϵif_i \le \epsilon_i is satisfied, ϵifi0\epsilon_i - f_i \ge 0, and λi\lambda_i tends towards 0. If violated, ϵifi<0\epsilon_i - f_i < 0, and λi\lambda_i increases, penalizing the violation more heavily. * Crucially, when all constraints are satisfied, E(e~)logP(e~x)\mathcal{E}(\tilde{e}) \approx -\log P(\tilde{e}|x), meaning the sampling process targets the original LM distribution.

  1. Constraint Implementation: Constraints fif_i must be differentiable functions of the embeddings e~\tilde{e} and share the LM's embedding table EE.
    • Soft Constraints (Classifiers/LMs): Auxiliary models (e.g., RoBERTa classifiers, GPT-2 based generative classifiers) are fine-tuned using the target LM's embedding table (potentially frozen, with a projection layer if dimensions differ). Constraints are defined on probabilities, e.g., pTOXIC(e~)<ϵp_{TOXIC}(\tilde{e}) < \epsilon or logp(e~LABEL1)>logp(e~LABEL0)\log p(\tilde{e}|LABEL1) > \log p(\tilde{e}|LABEL0). Prompt-based classification without fine-tuning is also explored.
    • Hard Constraints (Keywords): A differentiable distance function d(w,e~)d(w, \tilde{e}) measures the minimum Euclidean distance between the target keyword embedding ewe_w and any embedding ene_n in the sequence e~\tilde{e}, using a Gumbel-Softmax trick for differentiable selection. The constraint is d(w,e~)<ϵwd(w, \tilde{e}) < \epsilon_w, where the threshold ϵwlogπw,w\epsilon_w \approx -\log \pi_{w,w} (related to the probability of the keyword under its own embedding's distribution) aims to guarantee the keyword's presence. This extends to phrases via convolutions.

Experiments and Applications:

MUCOLA was evaluated on several tasks using GPT2-Large/XL and BART-Large:

  1. Toxicity Avoidance: Using a RoBERTa classifier constraint (pTOXIC<0.01p_{TOXIC} < 0.01), MUCOLA matched or outperformed baselines (DAPT, FUDGE, GeDi, DExperts) in reducing toxicity while achieving perplexity closer to unconstrained generation and maintaining diversity. Human evaluation showed similar toxicity/fluency to DExperts but better topicality.
  2. Sentiment Control: Using various classifiers (discriminative, generative, prompt-based) trained on SST-2 and Yelp, MUCOLA generally achieved good sentiment control with perplexity closer to the base LM than most baselines. Combining two classifiers (MUCOLA-TWO-DISC) performed strongly. Prompt-based constraints showed promise. Some degradation (repetition) was noted for very long sequences (length 50).
  3. Keyword-Guided Generation (COMMONGEN, ROC): Using the hard keyword distance constraint, MUCOLA achieved state-of-the-art coverage (percentage of required keywords included) and significantly better perplexity than baselines like COLD and Neurologic, though human fluency scores were slightly lower than the best baseline.
  4. Terminology-Constrained Machine Translation: Using a MarianMT model, MUCOLA achieved perfect terminology coverage while maintaining the BLEU score of unconstrained translation.
  5. Entity-Constrained Summarization: Preliminary results using BART showed promise in forcing specific entities into summaries.

Analysis:

  • Speed/Memory: MUCOLA's embedding-space operations allow processing significantly longer sequences (e.g., 500-1000 tokens) compared to vocabulary-space methods (limited to ~20-50 tokens) on standard GPUs. However, it's still slower (15-20x) than autoregressive decoding due to iterative updates.
  • Diversity: Langevin Dynamics noise, not random initialization, is the main driver of sample diversity.
  • Controllability: The threshold ϵi\epsilon_i provides interpretable control over constraint strength. Lowering ϵ\epsilon tightens control without hurting fluency (perplexity) but might decrease diversity.
  • Compatibility: The framework struggles when constraints are fundamentally incompatible (e.g., generating positive text containing strongly negative keywords).

Limitations:

  • Slower than autoregressive methods.
  • Requires pre-specifying output length (though standard mitigation techniques exist).
  • Can inherit LM degeneracy issues (e.g., repetition) for long sequences.

In conclusion, MUCOLA provides a flexible and effective framework for constrained sampling from large LMs. By operating in the embedding space and using Langevin Dynamics with a Lagrangian objective, it generates diverse, fluent text that adheres to multiple soft or hard constraints while staying close to the original LM's distribution.