Instruction Fine-Tuning & RLHF

Updated 19 January 2026

Instruction fine-tuning is a process that adapts pretrained LLMs using supervised data, while RLHF leverages human feedback to optimize model outputs.
These techniques improve model accuracy, stability, and naturalness by incorporating pairwise comparisons and robust variance reduction methods.
Recent advances integrate SFT and RLHF into unified pipelines, enhancing sample efficiency, mitigating reward model variance, and preventing catastrophic forgetting.

Instruction fine-tuning and reinforcement learning from human feedback (RLHF) have become central methodologies for aligning LLMs with human preferences, ensuring that outputs are helpful, harmless, and natural. Instruction fine-tuning uses supervised data to adapt a pretrained LLM to follow user directions but cannot fully specify nuanced human desires; RLHF addresses this by incorporating direct feedback, often collected via pairwise human comparisons or scalar preferences. The contemporary landscape features a diversity of RLHF algorithmic formulations, variance reduction techniques, hybrid pipelines, and robust theoretical work, all aimed at advancing LLM alignment, stability, and sample efficiency across tasks and model scales.

1. Instruction Fine-Tuning: Formalization and Pipeline Design

Instruction fine-tuning (SFT) initializes an LLM to follow explicit user directions. Formally, given input-output pairs $(x,y)$ , SFT minimizes the cross-entropy loss: $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ often using curated instruction datasets and templates matched to the deployment context. SFT is usually implemented as a parameter-efficient adaptation, using full fine-tuning or adapter-based techniques like LoRA or QLoRA (Dissanayake et al., 2024). The SFT model ( $\pi_\text{ref}$ ) forms the reference policy for RLHF, providing a strong linguistic and instruction-following prior (Lambert, 16 Apr 2025).

Limitations of SFT include its inability to represent complex value-dependent trade-offs (e.g. between helpfulness, safety, naturalness), and its susceptibility to catastrophic forgetting when applied in multi-stage fine-tuning pipelines where later updates overwrite previous instruction-following capabilities (Wang et al., 2024).

2. Reinforcement Learning from Human Feedback (RLHF): The Canonical Pipeline

The current practice of RLHF follows a three-stage recipe (Lambert, 16 Apr 2025, Ye et al., 3 Apr 2025, Gaur et al., 2024):

Supervised Instruction Fine-Tuning (SFT): As above, to obtain $\pi_\text{ref}$ .
Reward Model Training: Human annotators compare pairs of model responses $(x, y^1, y^2)$ , yielding binary labels $z \in \{0,1\}$ or scalar scores. A parametric reward model $r_\phi(x, y)$ is trained to satisfy: $p_\phi(x, y^1, y^2) \approx P(y^2 \succ y^1|x) = \sigma(r_\phi(x, y^2) - r_\phi(x, y^1))$ where $\sigma$ is the logistic sigmoid, implementing the Bradley–Terry (BT) model.
Policy Optimization: The LLM policy $\pi_\theta$ is optimized with respect to $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ 0 via trust-region RL methods, typically Proximal Policy Optimization (PPO) or, more recently, Direct Preference Optimization (DPO). The objective is: $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ 1 The KL penalty preserves pretraining knowledge and prevents reward hacking or language collapse (Du et al., 16 Feb 2025, Lambert, 16 Apr 2025, Dissanayake et al., 2024).

Variants include reward-weighted SFT (VAR) (Du et al., 16 Feb 2025), best-of-N sampling, and RRHF (Yuan et al., 2023).

3. Robustness and Enhancements in RLHF Algorithms

Classical RLHF pipelines are sensitive to reward model misspecification, distribution drift, and high estimator variance—especially under intransitive or inconsistent human feedback (Ye et al., 3 Apr 2025). Solutions include:

Variance-Reduced Preference Optimization (VRPO): VRPO introduces an auxiliary preference model $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ 2 to construct a control variate for the cross-entropy loss, yielding a variance-reduced estimator: $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ 3 VRPO is unbiased and strict variance and mean-squared error reduction (Theorem 4.1, 4.2), improving regret bounds and empirical win rates (77–81% on Anthropic HH) over DPO and classical PPO RLHF (Ye et al., 3 Apr 2025).
Policy Filtration (PF-PPO): Sample filtering strategies (e.g., Best-Random, Best-Worst) discard or down-weight samples for which the reward model is unreliable, as measured by the coefficient of determination ( $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ 4) against ground-truth signal (Zhang et al., 2024). PF-PPO achieves superior benchmark performance in code and math reasoning ( $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ 5 on HumanEval for 7B-scale LLMs).
Personalized and Continuous Reward Signals (ARF): Moving beyond the binary BT paradigm, ARF derives user-specific continuous reward scores from emotion-driven feedback, augmented by debiasing and dynamic preference tracking (Zhang, 3 Jul 2025). Trace-Biased (TB) fine-tuning is theoretically aligned with PPO/DPO but achieves up to $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ 6 improvement over DPO, with stable gradient norms.
Multi-Level Preference Learning (MAPL): To improve compliance on complex multi-instruction tasks, MAPL introduces intra- and inter-sample preference augmentation and new objectives targeting both prompt and response structure. MAPL enhances strict instruction-following accuracy by 12–13% over vanilla DPO in multi-constraint evaluation (Sun et al., 19 May 2025).

4. Simplified and Unified Fine-Tuning Paradigms

Recent advances have sought to unify and simplify post-pretraining fine-tuning:

Unified Fine-Tuning (UFT): UFT fuses SFT and RLHF (or DPO) into a single-stage loss function using an implicit reward of the form $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ 7. All feedback (binary, scalar, pairwise) is treated uniformly, preventing catastrophic forgetting of instruction-following skills seen in multi-stage pipelines (Wang et al., 2024). UFT achieves higher instruction-following (ifeval) and factuality (truthful-qa) metrics versus SFT+DPO/UNA.
Reward-Weighted SFT / Variational Alignment (VAR): By minimizing $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ 8 where $L_\text{sup}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_\text{sup}}\,\sum_{t=1}^T \log \pi_\theta(y_t|x,y_{<t})$ 9, VAR formulates RLHF as a reward-driven re-weighted SFT loss: each (x, y) pair scaled by $\pi_\text{ref}$ 0 (Du et al., 16 Feb 2025). VAR matches or outperforms DPO in reward, GPT-4 win rate, and training stability.
Direct Preference Optimization (DPO) and RRHF: DPO directly optimizes pairwise preference loss without modeling a scalar reward, using the log-policy ratio relative to the reference (Lambert, 16 Apr 2025, Dissanayake et al., 2024); RRHF aligns log-probabilities across candidate pools via ranking loss, absorbing SFT and reward-modeling into a single, efficient learning paradigm (Yuan et al., 2023).

5. Practical Implementation, Multilinguality, and Efficiency

Instruction fine-tuning and RLHF methodology are highly sensitive to data curation, computational constraints, and domain settings:

Pipeline design: OpenBezoar's "cheap and open" pipeline employs synthetic data generation, GPT-4 proxy filtering, and QLoRA adaptation, mitigating distribution shift and enabling resource-efficient alignment for 3B-parameter models (Dissanayake et al., 2024).
Multilingual alignment: Okapi demonstrates that the SFT+RLHF (PPO) recipe yields consistent +1–2% accuracy improvement over SFT across 26 languages in knowledge and commonsense tasks, with smaller gains for low-resource languages. RLHF remains robustly advantageous versus SFT even for non-English instruction-tuned LLMs (Lai et al., 2023).
Computational efficiency: RLHFSpec applies adaptive speculative decoding and sample reallocation to accelerate the generation bottleneck in RLHF, improving throughput up to $\pi_\text{ref}$ 1 and end-to-end iteration speed.

Implementation	Key Feature	Observed Benefit
VRPO (Ye et al., 3 Apr 2025)	Variance reduction	77–81% win rate on HH; lower regret
OpenBezoar (Dissanayake et al., 2024)	Cost-effective SFT + DPO	Outperforms similar-size models
RLHFSpec (Wang et al., 4 Dec 2025)	Adaptive speculative decoding	$\pi_\text{ref}$ 2 RLHF throughput
Okapi (Lai et al., 2023)	PPO for multilingual LLM	+1–2% over SFT on 26 languages

6. Theoretical Foundations, Limitations, and Future Directions

Theoretical analysis of RLHF convergence and robustness has advanced significantly:

Global optimality and distributional coupling: Recent work has framed RLHF as a bi-level optimization problem, jointly optimizing the reward model and policy in a coupled loop to mitigate distribution shift between reward learning and policy data (Gaur et al., 2024). Under weak gradient domination, provable convergence and polynomial sample complexity ( $\pi_\text{ref}$ 3) are established for neural parameterizations.
Variance and sample efficiency: Variance-reduced estimators (VRPO) achieve formal MSE and regret improvements even under reward model misspecification. Active preference elicitation and extensions to multi-turn dialogue and heterogeneous raters remain open research frontiers (Ye et al., 3 Apr 2025).
Robustness to misspecification and attacks: Instruction fine-tuning—even a few hundred adversarially-crafted examples—can empirically remove RLHF-induced safety constraints from high-end commercial models like GPT-4, illustrating the fragility of current post-training guardrails in the absence of KL regularization or parameter freezing (Zhan et al., 2023).
Diversity-generalization tradeoff: Rigorous experimental work reveals that RLHF improves both in- and out-of-distribution performance over SFT but at a clear cost to output diversity—diversity drops 70–80 points on per-input metrics for RLHF-tuned LLaMA-7B as opposed to SFT (Kirk et al., 2023).
Open challenges: The field continues to face open questions on balancing instruction retention with alignment, mitigating reward hacking, calibrating reward under labeler heterogeneity, sample-efficient online preference learning, and multi-modal/segment-level alignment (Yu et al., 2023).

Instruction fine-tuning and RLHF constitute the core methodologies for aligning LLMs with flexible, evolving human preferences. Their contemporary practice is characterized by sophisticated variance reduction, multi-level learning, personalized and robust reward modeling, and pragmatic fusion of supervised and reinforcement-based objectives. Empirical and theoretical results point to ongoing advances, but fundamental challenges regarding distributional robustness, efficiency, and safety remain active topics of research (Ye et al., 3 Apr 2025, Wang et al., 2024, Dissanayake et al., 2024, Sun et al., 19 May 2025, Lambert, 16 Apr 2025, Du et al., 16 Feb 2025).