Pretraining LLMs with Human Preferences Overview
The paper "Pretraining LLMs with Human Preferences" explores novel methodologies for developing LLMs (LMs) that inherently generate outputs aligned with human preferences. Instead of taking the conventional approach where alignment with human preferences is only considered during post-training finetuning, this paper investigates aligning LMs during the pretraining phase itself. This is accomplished by adjusting standard pretraining objectives, evaluating five distinct strategies for pretraining LMs with human feedback, and analyzing their performance across three specific tasks: minimizing toxicity, preventing personal identifiable information (PII) leakage, and ensuring code compliance with style guidelines.
The central claim of the paper is that incorporating human feedback into the pretraining of LMs can lead to a significant reduction in undesirable outputs without compromising the core capabilities of the models, challenging the existing paradigm of only aligning LMs during finetuning.
Methods
The authors propose and examine five objectives for pretraining LMs with human feedback:
- Conditional Training: This approach enhances maximum likelihood estimation (MLE) by conditioning the training process on segments of data being labeled with a human preference score. The model learns to associate each segment with a control token that corresponds to the segment's human preference score.
- Dataset Filtering: Filtering involves preprocessing the training data to exclude any instances falling below a specified threshold of human preference scores before standard MLE pretraining.
- Unlikelihood Training: This technique employs unlikelihood objectives where undesirable generation behavior is discouraged by reducing the likelihood of undesirable tokens during training.
- Reward Weighted Regression (RWR): It incorporates human preference scores directly into the training objective by weighting token log likelihoods with exponentiated reward values.
- Advantage-Weighted Regression (AWR): A variant of RWR, AWR employs a value function to adjust the segment-level rewards used in RWR, introducing a learned advantage estimator.
The efficacy of each method is evaluated against standard MLE in achieving both alignment (reducing undesired model outputs) and preserving the LM's general capabilities, as measured by the KL divergence from well-performing models like GPT-3 and task-specific evaluations.
Results and Implications
The paper's experiments reveal that conditional training consistently provides a robust alignment-capability trade-off, reducing undesired content across all tested tasks (toxicity, PII, and PEP8 compliance) without impairing the LM's generalizability or downstream performance on tasks such as GLUE benchmarks. In many settings, conditional training substantially decreases the probability of LM outputs manifesting undesirable content by up to an order of magnitude, outperforming even advanced post-pretraining finetuning techniques.
Furthermore, conditional training aligns well with both degradation constraints and diversity maintenance, as opposed to previously noted issues like degeneration or reduced diversity that some alignment mechanisms inadvertently produce. Adversarial robustness is also demonstrated, with models pretrained under conditional objectives showing notably less susceptibility to adversarial prompt engineering than baseline MLE-pretrained models.
By highlighting these results, the paper stresses a paradigm shift in LM training practices: the consideration of human preferences from the initial stages of training can be more advantageous than current methodologies which postpone alignment to later stages like finetuning or rule-based filters. This approach eliminates the complexity of unlearning undesirable behavior learned during large-scale text imitation and addresses potential performance degradation associated with abrupt post-pretraining interventions.
Future Directions
The reduction of undesirable behaviors through pretraining with human feedback paves the way for several future explorations. Practically, the work suggests avenues to improve current LMs' alignment methods by refining reward functions, evaluating alignment on expanded tasks beyond the initial three, and deploying conditional training paradigms in diverse LLM architectures. Theoretically, ongoing research may involve investigating the intrinsic trade-offs between generalization and robustness that conditional pretraining implicates, particularly as models scale in parameters and data volume. Integrating more granular and dynamic human feedback throughout pretraining could further enhance the adaptable nature of LMs in volatile and unpredictable operational environments, fortifying their ethical and performance benchmarks.
In summary, the proposed shift to pretraining methods that incorporate human preferences fundamentally questions the status quo of LM alignment, introducing strategies that enhance safety and reliability while preserving computational efficacy.