InfAlign: Inference-aware language model alignment (2412.19792v1)

Published 27 Dec 2024 in cs.LG, cs.CL, cs.IT, and math.IT

Abstract: LLM alignment has become a critical step in training modern generative LLMs. The goal of alignment is to finetune a reference model such that the win rate of a sample from the aligned model over a sample from the reference model is high, subject to a KL divergence constraint. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from LLMs rather than standard sampling. However, the alignment objective does not capture such inference-time decoding procedures. We show that the existing alignment framework is sub-optimal in view of such inference-time methods. We then modify the alignment objective and propose a framework for inference-aware alignment (IAPO). We prove that for any inference-time decoding algorithm, the optimal solution that optimizes the inference-time win rate of the aligned policy against the reference policy is the solution to the typical RLHF problem with a transformation of the reward. This motivates us to provide the KL-regularized calibrate-and-transform RL (CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. We particularize our study to two important inference-time strategies: best-of-N sampling and best-of-N jailbreaking, where N responses are sampled from the model and the one with the highest or lowest reward is selected. We propose specific transformations for these strategies and demonstrate that our framework offers significant improvements over existing state-of-the-art methods for LLM alignment. Empirically, we outperform baselines that are designed without taking inference-time decoding into consideration by 8-12% and 4-9% on inference-time win rates over the Anthropic helpfulness and harmlessness dialog benchmark datasets.

PDF Abstract

Overview of InfAlign: Inference-aware LLM Alignment

The paper presents a novel framework, InfAlign, designed to optimize the alignment of LLMs with specific inference-time procedures in mind. The authors identify a critical gap in the current alignment methodologies whereby the evaluation of LLM performance does not consider the inference-time strategies that are ubiquitously applied. The prevalent use of such inference-time algorithms, including Best-of-N sampling and controlled decoding, can lead to sub-optimal performance assessments if not duly incorporated during the alignment phase. InfAlign proposes a framework to bridge this gap, aiming to optimize LLMs for better performance during actual deployment.

Alignment and Post-Hoc Procedure Discrepancy

LLM alignment is typically performed using KL-regularized Reinforcement Learning (KL-RL) frameworks, where models are tuned to improve upon predetermined rewards while maintaining proximity to a base model. Conventionally, alignment success is evaluated through win rate comparisons against the base model. As the paper elucidates, this strategy overlooks the adaptive landscape of modern LLM deployment, where various inference-time techniques manipulate the sampling process to enhance output quality or diversity.

InfAlign Framework

The core contribution of this paper is the introduction and formal analysis of the InfAlign framework. The authors demonstrate theoretically and empirically that optimizing the win rate through conventional RLHF (Reinforcement Learning from Human Feedback) objectives is insufficient when inference-time processes are introduced. Instead, they propose an inference-aware alignment strategy, encapsulated in InfAlign, which explicitly modifies the reward structures to reflect inference-time objectives. The theoretical foundations rest on leveraging transformations of the reward to account for post-hoc inference mechanisms, ensuring the aligned policy is optimal under the inferential conditions it will face.

Calibrate-and-Transform RL (CTRL)

A practical instantiation of the InfAlign framework is provided through the KL-regularized Calibrate-and-Transform RL (CTRL) algorithm. CTRL involves a calibration step to adjust the baseline reward model to align with expected performance metrics during inference—as well as a transformative application to the aligned policy's distribution—and focuses on specified inference-time techniques such as Best-of-N sampling and Worst-of-N jailbreaking. In empirical settings, these methods showcased significant improvements in task-specific benchmarks. For example, when evaluated against the Anthropic helpfulness and harmlessness dialog datasets, the methods reported outperforming current state-of-the-art techniques by substantial margins (8-12% and 4-9% improvements in the respective benchmarks).

Practical and Theoretical Implications

The implications of this research are two-fold. Practically, it facilitates more accurate predictions of LLM performance post-deployment, leading to models better suited to deliver on-task objectives. Theoretically, it presents a novel perspective on the coupling of training and inferential objectives, opening avenues for future research to explore inference-aware calibrations and their broader applications. Additionally, by employing a transformation-centric view of reward dynamics, the framework offers robust mechanisms to generalize across distinct inference-time strategies.

Conclusion and Future Directions

This paper represents an important step toward reconciling the divergent objectives of training alignment and inference-time operation. By illustrating the inferiority of alignment methods lacking inference-awareness, the authors provide compelling evidence for the necessity of frameworks like InfAlign. Future research may focus on refining calibrations for complex procedures, particularly for multi-objective alignment challenges where trade-offs between different metrics must be carefully managed. Moreover, the exploration of dynamical systems-based methodologies could provide deeper insights into inferential transformations, potentially enhancing the capabilities of existing alignment frameworks.