Controlled Decoding from Language Models (2310.17022v3)

Published 25 Oct 2023 in cs.LG, cs.AI, and cs.CL

Abstract: KL-regularized reinforcement learning (RL) is a popular alignment framework to control the LLM responses towards high reward outcomes. We pose a tokenwise RL objective and propose a modular solver for it, called controlled decoding (CD). CD exerts control through a separate prefix scorer module, which is trained to learn a value function for the reward. The prefix scorer is used at inference time to control the generation from a frozen base model, provably sampling from a solution to the RL objective. We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that prefix scorers for multiple rewards may be combined at inference time, effectively solving a multi-objective RL problem with no additional training. We show that the benefits of applying CD transfer to an unseen base model with no further tuning as well. Finally, we show that CD can be applied in a blockwise decoding fashion at inference-time, essentially bridging the gap between the popular best-of-K strategy and tokenwise control through reinforcement learning. This makes CD a promising approach for alignment of LLMs.

PDF Abstract

An Analysis of "Controlled Decoding from LLMs"

The paper "Controlled Decoding from LLMs" by Sidharth Mudgal et al. presents a modular approach to controlling LLM (LM) outputs using a method termed Controlled Decoding (CD). This approach is designed to guide LLM outputs towards higher reward outcomes in a flexible and tunable manner, addressing the practical challenges in aligning LM outputs with human preferences.

Key Contributions

The primary contributions of the paper include:

Formalization of Controlled Decoding: The authors propose a novel framework called Controlled Decoding (CD) to solve KL-regularized Reinforcement Learning (RL) objectives. CD employs a prefix scorer, a module that learns a value function representing the expected reward of partially decoded sequences and uses this function during inference to guide the output of a frozen LLM.
Variants of Controlled Decoding: Two specific variants of Controlled Decoding are introduced:
- CD-FUDGE: Adapts the FUDGE method to the KL-regularized RL problem by treating the reward as observed from sequences rolled out from the base model.
- CD-Q: A new method based on Q-learning where the prefix scorer is trained to approximate the value function using an L2 loss over the rewards predicted by the base LLM.
Blockwise Decoding Strategy: They present an innovative blockwise decoding method, which bridges sequence-level best-of- $K$ strategies with token-wise RL controls. This variant offers a middle ground that effectively balances reward and KL divergence while improving efficiency and latency.
Empirical Validation: Extensive experiments demonstrate the efficacy of CD and its variants on tasks such as dialog response length optimization and improving dialog helpfulness and harmlessness. The experiments show that CD, particularly the blockwise version, outperforms traditional RL methods like PPO and newer methods like DPO and IPO in various benchmarks.

Methodological Insights

KL-Regularized Reinforcement Learning: The paper focuses on maximizing a reward while controlling the divergence from a pre-trained model. In particular, the RL objective is formulated as:

$J_\lambda([x, y^t]; \pi) = \lambda A([x, y^t]; \pi) - D([x, y^t]; \pi)$

where $A$ is the advantage function, and $D$ denotes the KL divergence. The proposed method finds an optimal decoding policy $\pi$ that maximizes this objective, ensuring a balance between achieving high rewards and maintaining similarity to the base model.

Prefix Scorer Training: For both CD-FUDGE and CD-Q, training involves learning to predict the value function accurately. The primary distinction lies in how the reward signals are used: CD-FUDGE uses observed rewards from generated sequences, whereas CD-Q updates the value estimate using BeLLMan updates akin to traditional Q-learning.

Scalability and Efficiency: The blockwise decoding variant of CD is particularly noteworthy for its ability to retain the benefits of sequence-level optimality (as in best-of- $K$ methods) while being computationally efficient. By decoding smaller chunks of tokens and selecting the best candidates based on the prefix scorer, this approach achieves impressive results with reduced computational costs.

Experimental Results

The experimental evaluations showcase the significant advantages of the proposed CD methods:

Dialog Response Length: Blockwise CD-Q achieves superior performance in optimizing response length compared to best-of- $K$ and various RL methods. It performs on par with best-of- $K$ but requires fewer samples, making it more efficient.
Helpfulness and Harmlessness (HH) Alignment: When applied to the task of improving dialog helpfulness and harmlessness, blockwise CD-Q again outperforms other methods, demonstrating significant increases in win rates over the base model with moderate KL divergence.
Robustness and Flexibility: The results also highlight the flexibility of the CD framework in integrating multiple objectives. Furthermore, the prefix scorer trained on one base model (PaLM 2-XXS) performs well on another (PaLM 2-XS), indicating robustness to changes in the underlying generative model.

Theoretical and Practical Implications

The primary theoretical contribution of this paper lies in the formal connection established between controlled decoding and KL-regularized reinforcement learning. This connection provides a sound basis for developing modular and flexible alignment techniques that can be adapted to various generative tasks.

From a practical standpoint, CD offers a powerful tool for aligning LLM outputs with desired properties without the need for extensive retraining of the LLM itself. This modular approach facilitates quick adaptation to new tasks and preferences, making it highly valuable for real-world applications where flexibility and efficiency are critical.

Future Directions

The paper opens several promising avenues for future research:

Enhanced Learning Objectives: Exploring sophisticated learning objectives beyond the L2 loss used for the prefix scorer, potentially leveraging Fisher information shaping or other divergences.
Handling Noisy Rewards: Improving the robustness of CD-Q and CD-FUDGE in noisy reward environments, possibly through advanced value function learning methods tailored to handle reward uncertainty.
Hybrid Techniques: Combining CD with other alignment methods (e.g., PPO, DPO) to further optimize the trade-offs between reward, KL divergence, and computational efficiency.

In conclusion, Controlled Decoding provides a significant step forward in the alignment of LLMs, offering a highly tunable and efficient approach that preserves the strengths of pre-trained models while ensuring high reward outcomes. This work is poised to have substantial impact on future developments in AI alignment and generative model control.