SLiC-HF: Sequence Likelihood Calibration with Human Feedback
- SLiC-HF is a method that calibrates sequence likelihoods of large language models to match human feedback using contrastive pairwise ranking.
- It integrates on-policy and off-policy human data through direct and sample-based ranking approaches, reducing computational and memory overhead compared to RLHF.
- Empirical results on tasks like Reddit TL;DR summarization show improved win-rates, ROUGE scores, and enhanced confidence calibration metrics.
Sequence Likelihood Calibration with Human Feedback (SLiC)
Sequence Likelihood Calibration with Human Feedback (SLiC-HF) is a framework for aligning the output likelihoods of LLMs with human preferences using contrastive, sequence-level calibration. SLiC-HF extends the original Sequence Likelihood Calibration (SLiC) method to leverage human comparison data or reward models, and presents a computationally efficient and practically robust alternative to Reinforcement Learning from Human Feedback (RLHF) methods such as PPO-based RLHF. SLiC-HF integrates both on-policy and off-policy human feedback and achieves strong empirical results on high-stakes sequence generation tasks, such as the Reddit TL;DR summarization benchmark, with significant improvements in both human and automatic evaluation metrics (Zhao et al., 2023).
1. Foundations of SLiC and Extension to Human Preferences
The original SLiC framework is a sequence-level contrastive fine-tuning approach designed to align a model’s likelihood estimates with an externally specified ranking function over sequences, such as ROUGE scores in summarization. Instead of using RL to maximize expected reward, SLiC directly modifies the standard cross-entropy loss by adding a pairwise ranking margin term over full sequences. The workflow consists of two stages: supervised fine-tuning (SFT) of the model using (input, reference) pairs, followed by “calibration” — adjusting the model parameters so that the model’s sequence probabilities rank “good” generations higher than “bad” ones according to the external ranking function.
SLiC-HF extends this by replacing the external ranking function with a human preference-derived ranking . It uses human-annotated paired data , where and are side-by-side generations with annotated preferences. These can be incorporated directly, even if generated by a different model (“off-policy”), or via a learned reward or ranking model, resulting in two training regimes: SLiC-HF-direct and SLiC-HF-sample-rank (Zhao et al., 2023).
2. Mathematical Objectives and Loss Functions
SLiC-HF defines its loss as a sum of a pairwise calibration term and a regularization term:
- Reward Model Training: From , a pointwise reward model is trained with:
Alternatively, a pairwise ranking model is trained analogously.
- Calibration Loss (no HF):
with a margin hyperparameter .
- SLiC-HF Loss: For calibration with human feedback,
The regularization weight is typically set via grid search, and the regularization target can be the gold reference or the highest-ranked sampled decode. The margin commonly ranges and in .
3. Methods of Integrating Human Preference Data
SLiC-HF accommodates human feedback in two principal manners:
- SLiC-HF-direct: For each in , the model minimizes the calibration loss plus regularizer without decoding new candidates. This simplicity facilitates implementation but can induce instability when the feedback samples diverge from the SFT model distribution.
- SLiC-HF-sample-rank: For each from , candidate generations are sampled from . These candidates are ranked using a pretrained reward or ranking model derived from human feedback. Positive/negative pairs are constructed and used to compute the calibration loss. This mode ensures the model is updated on its own output distribution, improving stability relative to policy drift without requiring on-policy human feedback collection.
The following table summarizes SLiC-HF training modes:
| Mode | Input Pairs | Decoding | Ranking Source |
|---|---|---|---|
| SLiC-HF-direct | None | ||
| SLiC-HF-sample-rank | samples | Yes |
Editor’s term: “Mode” to compactly categorize protocol variants.
4. Algorithmic Implementation and Pseudocode
A representative pseudocode for SLiC-HF-sample-rank training on T5-Large is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
D_SFT = {(x, y_ref)} # supervised data
D_HF # human-preference triples or pretrained r_phi/c_psi
theta = pretrained SFT weights
hyperparams = {delta=1.0, alpha=8.0, m=8, T=0.7, top_k=40, lr=1e-5, batch=32}
for epoch in range(1, E+1):
for B in batches(D_SFT, size=batch):
for x in B:
candidates = sample_from_model(theta, x, m, T, top_k)
if direct:
y_plus, y_minus = get_from_D_HF(x)
else:
ranked_pairs = rank_candidates(candidates, r_phi or c_psi)
L_cal = sum(max(0, delta - log_p(theta, y_plus|x) + log_p(theta, y_minus|x))
for (y_plus, y_minus) in ranked_pairs)
y_reg = y_ref or best_ranked_candidate
L_reg = -log_p(theta, y_reg|x)
L_batch += L_cal + alpha * L_reg
theta -= lr * gradient(theta, L_batch) |
In SLiC-HF-direct, the candidate sampling and ranking stages are skipped and pairs are drawn directly from (Zhao et al., 2023).
5. Empirical Performance and Experimental Analysis
Experimental validation centers on the Reddit TL;DR summarization task, with containing supervised summaries and comprising 64k human preference triples from prior model generations. Results are reported in terms of win-rate (fraction of summaries preferred by a held-out T5-XXL ranker over the original reference), ROUGE, and human evaluations.
Key findings include:
- Supervised T5-Large achieves 44.96% win-rate (ROUGE-1=35.1).
- Continuing on positives in : 51.65%.
- Filtering best of 8 decodes by reward/ranking model: ≤65.4%.
- SLiC-HF-direct: 82.9%.
- SLiC-HF-sample-rank (m=8): using reward model/regularization on gold: 82.4%, on best decode: 83.5%; using ranking model/regularization on gold: 86.2%, on best decode: 85.5%.
- Human four-way eval: SLiC-HF was best 73% of the time, mean quality 3.82/5 vs. 3.32 (continue FT), 96.6% judged factual.
- Against PPO RLHF (Stiennon et al., T5-XXL 6B): T5-Large SLiC-HF (ranking) was preferred 66% of the time (p<0.05), with higher mean quality. The reward-model SLiC-HF variant matched PPO RLHF.
- Scaling the generation model to T5-XXL (11B) lifted SLiC-HF sample-rank to 96.1% win-rate; increasing beyond 8 provided only marginal benefit.
Ablation and sensitivity studies reveal: SLiC-HF-direct can drift in output length, while sample-rank is more stable. Ranking models outperform reward models by ≈2% on feedback data and by ≈3% win-rate. Using best sampled decode vs. gold reference for regularization has negligible effect, supporting decoupled reference-free calibration. Hyperparameters exhibit low sensitivity within recommended ranges (Zhao et al., 2023).
6. Architectural and Computational Properties
Compared to PPO-based RLHF, SLiC-HF reduces engineering and compute requirements:
- Memory: SLiC-HF maintains only a single policy model (plus optional reward/ranking model), circumventing multiple large models (policy, value, reward, reference) required by PPO RLHF, yielding approximately 3× memory savings.
- Computational Efficiency: Decoding of candidate generations in SLiC-HF can be parallelized for large batches (~8× standard PPO batch sizes) because the SFT model is fixed during decoding; PPO requires alternating between sampling and gradient steps, limiting throughput.
- Tuning Simplicity: SLiC-HF involves fewer hyperparameters ({, learning rate}), and empirical evidence shows less sensitivity and easier optimization compared to PPO’s clip ratios, value losses, and epoch settings.
- Training Speed: SLiC-HF’s gradient steps are nearly as fast as ordinary fine-tuning, due to the absence of decoding or value network computation in the inner loop (Zhao et al., 2023).
7. Calibration Metrics and Confidence Elicitation with Human Feedback
SLiC’s methodological focus enables the calibration of model confidence—making the model’s predicted likelihood closely match the observed empirical correctness rate. Calibration is assessed using:
- Expected Calibration Error (ECE): Aggregate absolute differences between mean confidence and observed accuracy per bin across bins.
- Temperature-scaled ECE (ECE-t): ECE computed after applying a temperature scalar to confidences, optimized over a held-out set.
- Brier Score (BS): Mean squared error between (possibly temperature-scaled) confidences and true labels.
- Selective-Classification AUC: Area under the curve plotting accuracy as a function of confidence threshold coverage.
To elicit calibrated confidence from RLHF-tuned LMs that may provide suboptimal conditional probabilities, various protocols are used, including direct verbalization of probability estimates and sampling-based approaches:
| Elicitation Method | Computation | Calibration Outcome |
|---|---|---|
| Label Prob. (K=10) | Sampling | Baseline |
| Verb. 1S top- | Verbalization | ~50% ECE reduction |
| Verb. 2S, CoT | Rationale + Verbal | No added gain |
| Ling. 1S-human | Phrase mapping | Comparable |
Verbalized confidences often halve the calibration error compared to likelihood-derived scores, without accuracy loss. For instance, on SciQ with GPT-3.5-turbo, ECE drops from 0.256 (label prob.) to 0.132 (verb. top-2) and 0.065 (verb. top-4); on TriviaQA with GPT-4, ECE drops from 0.078 to 0.041 (Tian et al., 2023).
Empirical evidence indicates that RLHF shifts probability mass towards the most-preferred answer, weakening the probabilistic calibration link to actual correctness. Prompting the model for explicit confidence or linguistic hedges (“Just Ask” approach) exploits its latent calibration capabilities acquired during instruction tuning and preference learning (Tian et al., 2023).
8. Limitations, Variations, and Practical Recommendations
SLiC-HF’s applicability has been robustly demonstrated for sequence-level tasks such as summarization, but generalization to complex, long-form, or multi-step reasoning remains an open direction. There is some prompt and protocol sensitivity in confidence elicitation; explicit calibration sets (20% split for temperature or mapping optimization) can mitigate this. Ablation studies support the robustness of key hyperparameters and regularization targets.
Best practices for application include: always reporting both raw and temperature-scaled ECE, eliciting at least top-2/top-4 candidates for confidence, and fitting prompt or mapping parameters on a small held-out set (Zhao et al., 2023, Tian et al., 2023).
In summary, Sequence Likelihood Calibration with Human Feedback (SLiC-HF) is a pairwise-contrastive, memory- and compute-efficient method for aligning LLM likelihoods with human preferences using both on- and off-policy data, achieving empirically superior or competitive results to PPO-based RLHF, and supporting high-fidelity, well-calibrated model confidence estimation.