Reinforcement & Preference-Based Fine-Tuning

Updated 26 November 2025

The paper presents a framework that integrates reinforcement learning with preference data to align complex models with human judgment.
It employs methods like Bradley–Terry logistic modeling and Bayesian inference with Laplace approximation to calibrate reward estimations.
Empirical results show improvements in sample efficiency and task performance across domains such as language generation, image synthesis, and robotics control.

Reinforcement and Preference-Based Fine-Tuning is a paradigm at the intersection of reinforcement learning (RL) and preference learning for aligning high-capacity generative models—especially large neural networks—with human (or proxy) judgment signals. These techniques have become central to the development and deployment of language, vision, and multimodal models, enabling task adaptation, safety, and model interpretability in domains where scalar rewards are difficult to specify but pairwise or ordinal preferences are readily elicitable.

1. Frameworks for Integrating Preferences in Reinforcement Learning

Preference-based fine-tuning methods address the intrinsic limitation of classic RL in real-world domains where reward functions are ill-defined or elicit subjective value judgments, such as natural language generation, image synthesis, and robotics. The central workflow typically comprises:

Preference data collection: Human or synthetic judges compare outputs (e.g., completions, trajectories, images) and provide pairwise, triplet, or ranking feedback.
Reward model training: The latent reward function $r_\phi(x, y)$ is optimized via likelihood-based objectives, typically the Bradley–Terry or logistic-dueling model, to map pairs (context $x$ , output $y$ ) onto scores such that $P(y_1 \succ y_2 \mid x, \phi) = \sigma(r_\phi(x, y_1) - r_\phi(x, y_2))$ (Cercola et al., 6 Nov 2025).
Policy optimization: The generative policy $\pi_\theta$ is fine-tuned to maximize expected reward (via PPO, RLOO, GRPO, or DPO), optionally regularized by a KL divergence to a pre-trained reference $\pi_{\mathrm{ref}}$ (Winata et al., 17 Sep 2024).

Preference-based Optimization (PBO) using GPs emphasizes sample efficiency via active querying, but scales poorly to high-dimensional tasks. RLHF (Reinforcement Learning from Human Feedback) scales to large neural models by fitting neural reward heads to human pairwise data and optimizing via policy gradient, but is annotation-intensive. Hybrid frameworks, e.g., Bayesian RLHF (Cercola et al., 6 Nov 2025), combine active acquisition (Dueling Thompson Sampling) with neural scalability by injecting uncertainty (Laplace approximation) in the reward model, enabling calibrated, sample-efficient querying.

2. Mathematical Foundations and Acquisition Strategies

Preference modeling universally adopts the Bradley–Terry (BT) logistic model (probability of preference depends on reward difference), extended in some settings to the Bradley–Terry with ties (BTT) to handle indifference (Liu et al., 5 Oct 2024). The typical loss is

$\mathcal{L}(D; \phi) = -\sum_i \left[ \delta_i \log \sigma(\Delta_i) + (1-\delta_i) \log \sigma(-\Delta_i) \right] + \frac{1}{2} (\phi-\phi_0)^\top \Lambda (\phi-\phi_0)$

where $\delta_i$ are binary preference labels, and the Gaussian prior on $\phi$ enables Bayesian inference via Laplace approximation.

Active preference quantification is realized through acquisition modules:

Exploitation (Sparring): Selects rivals with high win-score proxies via softmax sampling.
Exploration (MaxVar): Selects rivals maximizing posterior variance of preference probability under Laplace-approximated uncertainty.
Mixed Strategy: Scores candidates with standardized sparring and variance proxies, balancing exploration and exploitation via tunable parameter $\alpha$ (Cercola et al., 6 Nov 2025).

This machinery is tightly coupled into RLHF pipelines, with preference acquisition driving human annotation, reward-model updates incorporating experimental uncertainty, and policy optimization via PPO or group-wise variants (GRPO, GRPO $_{\text{rank}}$ ) using standardized or rank-aware advantages (Shi et al., 2 Oct 2025, Feng et al., 7 Nov 2025).

3. Algorithms and Empirical Performance

A spectrum of fine-tuning algorithms has emerged, differentiated by their architecture, optimization strategy, and trade-off between sample-efficiency and computational scalability:

Algorithm	Reward Model	Preference Use	Policy Update	KL Reg.	Sample Efficiency
RLHF (PPO)	Neural/GP	Pairwise/Ordinal	On-policy, PPO	Yes	High budget
Bayesian RLHF	Neural + Laplace	Active Acquisition	PPO	Yes	Efficient, scalable
GRPO/GRPO $_{\text{rank}}$	None/Ranker	Ordinal (multi-group)	Group PPO / Ordinal	Yes	Moderate
DPO	None	Fixed preference pairs	Cross-entropy	Implicit	Highest simplicity
Preference Flow Matching	None	ODE-based, pairwise	Flow field, ODE	N/A	Black-box only

Bayesian RLHF has shown superior sample and computational efficiency over traditional GP-PBO and RLHF in high-dimensional Rosenbrock optimization domains and LLM fine-tuning benchmarks. Empirically, B-RLHF achieves up to 44% lower final error (2D Rosenbrock) and reaches state-of-the-art LLM test-set accuracy (+6–14% over RLHF under varied budgets) (Cercola et al., 6 Nov 2025). GRPO and GRPO $_{\text{rank}}$ enable rank-based, highly data-efficient video-LLM alignment with up to +6pt accuracy gain compared to scalar-reward RLAIF (Shi et al., 2 Oct 2025). DPO and active DPO further improve fine-tuning efficiency, reducing preference-labeling requirements by up to 6 percentage points win-rate over random selection (Muldrew et al., 12 Feb 2024).

4. Extensions Across Modalities and Tasks

Preference-based RLFT architectures have been extended beyond standard RLHF and LLM domains:

Vision and Multimodal Models: PreResQ-R1 implements dual-branch reward formulation for visual QA, balancing intra-sample response coherence (fine-grained regression) and inter-sample ranking (ordinal score alignment). GRPO loss optimizes chain-of-thought outputs, with extensions to video QA via global-temporal and local-spatial rewards. State-of-the-art results (+5.3% SRCC, +2.15% PLCC) are achieved across 15 benchmarks (Feng et al., 7 Nov 2025).
Federated RLHF: FedBis and FedBiscuit encode binary selectors across clients, aggregate preferences via cluster-wise adapters, and enable privacy-preserving RLHF with competitive win rates (FedBiscuit matches or exceeds centralized agreement and best-of-n scores) (Wu et al., 3 Jul 2024).
Diffusion Policies and Robotics: FDPP applies preference learning to fine-tune diffusion policies, robustly aligning pre-trained robot control with user preferences, while imposing KL regularization to prevent task-competence collapse (Chen et al., 14 Jan 2025).
Structured Outputs: Mesh-RFT introduces Masked DPO leveraging topology-aware metrics (Boundary Edge Ratio, Topology Score) for fine-grained mesh refinement, reducing Hausdorff distance and improving topology by 17–25% over baselines (Liu et al., 22 May 2025).
Behavioral Cloning with Online PB-RL: BRIDGE combines offline safe policy initialization with online preference-based RL, integrating uncertainty-weighted objectives and achieving provably lower regret, especially with large expert datasets (Macuglia et al., 30 Sep 2025).
Reward Modeling with Ties: Incorporating ties via the BTT model measurably reduces bias in preference gap estimation and consistently improves win rates over the standard BT approach (Liu et al., 5 Oct 2024).

5. Theoretical Insights and Limitations

The theoretical guarantees of preference-based RLFT hinge on problem dimensionality, coverage of exploration, and acquisition strategies:

Global vs. Partial Coverage: Offline contrastive methods (DPO) require global support—preference data must cover all actions considered by the policy, or risk infinite KL penalties and catastrophic value collapse. Online methods (PPO, RLHF) only require partial coverage within KL-balls since on-policy rollouts dynamically restrict the distribution (Song et al., 3 Jun 2024).
Regret Bounds: Algorithms such as REGIME and BRIDGE establish sample complexity and cumulative regret bounds, scaling polynomially in feature dimension and horizon, and improving with offline demonstration set size (Zhan et al., 2023, Macuglia et al., 30 Sep 2025).
Uncertainty Quantification: Laplace posterior approximation and Bayesian regret bounds underpin active querying strategies and provide principled selection in high-dimensional regimes (Cercola et al., 6 Nov 2025).
Limitations: Proxy annotator reliance (vs. human raters), sensitivity to acquisition hyperparameters (e.g., $\alpha$ ), and limited extension to black-box or variable-length output domains remain unresolved. Human tie annotations and multi-aspect feedback are challenging but active areas for future paper (Cercola et al., 6 Nov 2025, Liu et al., 5 Oct 2024).

6. Practical Implementation, Design Guidelines, and Impact

Key practical design choices and impact areas:

Query Budget Management: Active DPO and Bayesian RLHF outperform random selection, substantially reducing required human (or oracle) labels for equivalent or better final performance (Muldrew et al., 12 Feb 2024, Cercola et al., 6 Nov 2025).
KL Regularization: Both classic RLHF and PB-RLFT pipelines incorporate explicit or implicit KL penalties to preserve the original model’s competencies and prevent overfitting to sparse or noisy preference signals (Chen et al., 14 Jan 2025, Winata et al., 17 Sep 2024).
Scalability and Computation: Last-layer Laplace approximation enables tractable Bayesian updates in high-dimensional nets, while flow-matching (PFM) streamlines preference alignment as an ODE wrapper for black-box models (Cercola et al., 6 Nov 2025, Kim et al., 30 May 2024).
Modality-Aware Extensions: Chain-of-thought reasoning activation, topology-aware masks, and federated client aggregation are essential for cross-domain and privacy-sensitive deployment (Feng et al., 7 Nov 2025, Liu et al., 22 May 2025, Wu et al., 3 Jul 2024).
Empirical Impact: RLHF and preference-based fine-tuning deliver significant gains in LLM factuality, visual QA, mesh generation, and robot manipulation, with state-of-the-art results reported across diverse benchmarks (Cercola et al., 6 Nov 2025, Feng et al., 7 Nov 2025, Tian et al., 2023, Chen et al., 14 Jan 2025).

7. Future Directions, Benchmarks, and Open Challenges

Preference and reinforcement-based fine-tuning continues to evolve, with open challenges and promising directions including:

Automated Preference Signals: Leveraging synthetic judges, LLM-based oracles, and active learning to address human annotation cost and feedback scalability (Muldrew et al., 12 Feb 2024, Tian et al., 2023).
Multimodal and Federated Expansions: Bridging modality gaps, accommodating client heterogeneity, and developing scalable privacy-safe protocols (Wu et al., 3 Jul 2024, Shi et al., 2 Oct 2025).
Bias Correction and Rich Feedback: Explicit modeling of tie bias, multi-aspect preferences, and robust aggregation strategies (Liu et al., 5 Oct 2024, Feng et al., 7 Nov 2025).
Coverage-Aware Pipelines: Selecting DPO, RLHF, or hybrid algorithms (HyPO) based on coverage, sample diversity, and computational constraints (Song et al., 3 Jun 2024).
Mechanistic Understanding: Empirical demonstration that online RL-based fine-tuning (PPO, GRPO) enhances internal model activation intensity and diversity, while static preference optimization (DPO) leaves model circuits largely unchanged (Zhang et al., 25 Sep 2025).
Comprehensive Benchmarking: RewardBench, MetaMetrics, and AlpacaEval, among others, are being advanced for modality-unified and metric-calibrated evaluation (Winata et al., 17 Sep 2024).
Theoretical and Practical Integration: Unifying reward modeling, policy optimization, and preference aggregation in robust, adaptive pipelines for scalable model alignment.

Taken together, reinforcement and preference-based fine-tuning represents the state-of-the-art for aligning complex ML systems with subjective human values across language, vision, robotics, and beyond, with principled frameworks achieving increasingly high sample efficiency, cross-domain generalizability, and empirical robustness.