FRPO: Robust Policy Fine-Tuning

Updated 3 July 2026

FRPO is a framework that enhances policy fine-tuning by optimizing for flat optima that resist small perturbations and distribution shifts.
It unifies methods from continuous control, RLHF, DRO, and robust MDPs to mitigate issues like catastrophic forgetting and brittle local optima.
Empirical studies show FRPO improves reward stability and safety preservation across diverse tasks, ensuring robust downstream performance.

Fine-tuning Robust Policy Optimization (FRPO) is a general framework for enhancing the robustness of policy fine-tuning in machine learning, particularly in reinforcement learning and RLHF (Reinforcement Learning from Human Feedback) settings. It encompasses several algorithmic families, all unified by the goal of ensuring that candidate policies not only maximize in-distribution reward but also maintain performance when subject to small perturbations, distributional shifts, or downstream updates. FRPO addresses vulnerabilities such as catastrophic forgetting, brittle local optima, and distribution drift, and is instantiated in continuous control, LLM post-training, and reward/preference-based robust RL scenarios.

1. Motivation and Conceptual Foundations

FRPO was motivated by the observation that standard policy optimization—whether in deep RL or RLHF—often converges to high-reward solutions that are fragile to changes in task conditions, reward functions, or further fine-tuning. In multi-stage RLHF for LLMs, downstream supervised or RL fine-tuning can severely compromise previously learned behaviors (e.g., safety, reasoning), a phenomenon known as catastrophic forgetting (Sabbaghi et al., 9 Feb 2026). Similarly, RL agents optimized by classic policy-gradient approaches may become overly deterministic and lose exploration capacity as training progresses (Rahman et al., 2022). Standard remedies act at downstream time (rehearsal, regularization, model merging), but FRPO introduces robustness already at the base policy optimization stage. The core principle is to maximize not only the policy's reward but also the minimum reward across all plausible, small shifts in policy or data distribution, thus seeking "flat" optima less susceptible to post-hoc degradation (Mandal et al., 1 Mar 2025).

2. FRPO Variants: Problem Settings and Robust Objectives

FRPO encompasses several instantiations, each tailored to a different robustness geometry or learning scenario:

Entropy-perturbed continuous control (RPO-based FRPO): For continuous action RL, FRPO may refer to Robust Policy Optimization (RPO), which injects controlled, state-independent perturbations into the policy mean, maintaining high policy entropy and exploration. This variant targets robustness against policy collapse during training and is readily applicable to any PPO-like update (Rahman et al., 2022).
KL-robust RLHF (“Max-Min” FRPO): For alignment and RLHF, FRPO formalizes robustness to fine-tuning by directly optimizing a max–min objective: the policy's worst-case expected reward over all policies within a KL ball of the base policy—corresponding to plausible downstream SFT or RL updates. This leads to an “entropic risk” regularization, emphasizing reward stability under policy shifts (Sabbaghi et al., 9 Feb 2026).
Distributionally robust preference-based RL (DRO-based FRPO): Here, FRPO targets robustness to distribution shift in prompts or environmental situations, minimizing the worst-case reward or preference loss across all source distributions within a data-divergence ball (e.g., TV, χ²) of the training distribution (Mandal et al., 1 Mar 2025).
Robust MDP policy optimization: In classical robust RL, FRPO may refer to fine-tuning with objectives that maximize the policy’s value under adversarial transitions within an uncertainty set, using algorithms such as Robust Policy Mirror Descent (RPMD) (Li et al., 2022).

FRPO Variant	Robustness Set	Representative Objective
Continuous control (RPO)	Perturb. noise (α)	Gaussian + uniform mean noise
RLHF (KL-ball)	KL(π′‖π_base) ≤ ρ	Entropic risk: –λ log E exp(–r/λ)
Dist.-robust RLHF/DPO	d_ϕ(D, D_src) ≤ ρ	Min-remap over worst-case minibatch weights
Robust MDP	U_{s,a} transition set	min_π max_u E[cost

3. Mathematical Formulations and Algorithms

Continuous Control with Policy Perturbations

In the RPO-based approach, the standard Gaussian policy $\pi_\theta(a|s) = N(a; \mu(s;\theta), \sigma(s;\theta))$ is perturbed:

$\text{Sample } z \sim U[-\alpha, \alpha],\quad a \sim N(\mu(s;\theta)+z, \sigma(s;\theta))$

The effective action distribution $q_{\theta, \alpha}$ is the convolution of Gaussian and uniform. This preserves policy entropy and enhances exploration. In fine-tuning, α can be fixed or adapted (e.g., by matching a target entropy) (Rahman et al., 2022).

Entropic-Risk Optimization (RLHF Max-Min)

FRPO for RLHF involves solving

$\max_{\pi_\theta} \left\{ \inf_{Q: \mathbb{E}_x D_{KL}(Q(\cdot|x)\|\pi_\theta(\cdot|x)) \leq \rho} \mathbb{E}_{x,y \sim Q}[r(x,y)] - \beta \mathbb{E}_x D_{KL}(\pi_\theta(\cdot|x)\|\pi_\text{ref}) \right\}$

Fenchel duality yields the entropic-risk objective

$J_\lambda(\pi_\theta) = -\mathbb{E}_x [\lambda \log \mathbb{E}_{y \sim \pi_\theta}[e^{-r(x,y)/\lambda}]] - \beta \mathbb{E}_x [D_{KL}(\pi_\theta, \pi_\text{ref})]$

with $\lambda > 0$ controlling conservatism. As $\lambda \to \infty$ , recovers standard RLHF; finite $\lambda$ penalizes reward variance, promoting “flat” solutions robust to small policy changes. Implementation proceeds as a modification of Group-based PPO, with minibatch importance weighting and bias correction via jackknife (Sabbaghi et al., 9 Feb 2026).

Distributionally Robust Optimization (DRO-FRPO)

Given a divergence $d_\phi$ (e.g., TV or χ²) and robustness radius ρ, FRPO solves:

$\max_\pi \min_{D: d_\phi(D, D_\text{src}) \leq \rho} \mathbb{E}_{x \sim D, y \sim \pi(\cdot|x)} \left[ \widehat{r}(x, y) - \beta \log \frac{\pi(y|x)}{\pi_\text{ref}(y|x)} \right]$

The robust minibatch SGD algorithm computes worst-case reweightings $\text{Sample } z \sim U[-\alpha, \alpha],\quad a \sim N(\mu(s;\theta)+z, \sigma(s;\theta))$ 0 over each batch, then applies modified gradients for both reward model training and policy optimization. Convergence guarantees scale as $\text{Sample } z \sim U[-\alpha, \alpha],\quad a \sim N(\mu(s;\theta)+z, \sigma(s;\theta))$ 1 in sample complexity for policy optimization (Mandal et al., 1 Mar 2025).

Robust MDP Fine-tuning

For robust MDPs considering transition uncertainty, FRPO (via RPMD) defines per-state policy updates:

$\text{Sample } z \sim U[-\alpha, \alpha],\quad a \sim N(\mu(s;\theta)+z, \sigma(s;\theta))$ 2

Stochastic robust TD methods are used to estimate $\text{Sample } z \sim U[-\alpha, \alpha],\quad a \sim N(\mu(s;\theta)+z, \sigma(s;\theta))$ 3, and fine-tuning starts from a nominal policy, incrementally increasing robustness (Li et al., 2022).

4. Fine-Tuning Procedures and Implementation

The following high-level pseudocode captures FRPO instantiations:

Collect rollouts or preference/minibatch data using a reference or lagged policy.
For each minibatch:
- Compute robustness-induced perturbations:
  - RPO: sample uniform noise, modify means.
  - DRO: solve convex problem for worst-case batch weights $\text{Sample } z \sim U[-\alpha, \alpha],\quad a \sim N(\mu(s;\theta)+z, \sigma(s;\theta))$ 4.
  - RLHF-max-min: compute entropic risk terms over group samples.
- Estimate gradients of corresponding robust objectives.
- Apply policy or value updates with selected optimizer and hyperparameters.
Optionally adjust robustness parameters (e.g., α, ρ, λ) online.
For iterative algorithms: periodically update reference policies or groupings as in PPO.

Batch sizes, network architectures, and auxiliary tricks (advantage normalization, value clipping, orthogonal weight init) follow those of standard PPO or RLHF pipelines. For DRO-FRPO, solving the reweighting step is efficiently done via convex solvers and does not dominate compute for practical minibatch sizes (Mandal et al., 1 Mar 2025).

5. Empirical Results and Performance Analysis

FRPO methods deliver enhanced robustness and stability across a variety of domains:

Continuous Control: On DeepMind Control tasks, RPO-based FRPO achieves on average a 3× return improvement over PPO and maintains higher, stable entropy throughout training. In long-duration runs, RPO prevents the catastrophic decline typical of PPO (Rahman et al., 2022).
LLMs: In RLHF, FRPO significantly reduces post-fine-tuning safety degradation. For example, after downstream SFT, refusal rates on harmful prompts decrease far less than for standard RLHF, preserving up to 40% more safety. In multi-task continual learning (e.g., math → code), FRPO preserves up to 22 percentage points more held-out accuracy after domain shifts (Sabbaghi et al., 9 Feb 2026).
Distributional Robustness: On out-of-distribution (OOD) alignment leaderboards and reasoning datasets, DRO-FRPO improves accuracy by 2–5 points on aggregate, with 10–25% relative gains on challenging reasoning subsets compared to vanilla preference optimization. On lead benchmarks (HHH Alignment, MT-Bench), policy accuracy and human-evaluated win rates also improve (Mandal et al., 1 Mar 2025).
Sample Complexity: Theoretical results establish that robust policy optimization methods (including DRO-FRPO and robust MDPs) converge in $\text{Sample } z \sim U[-\alpha, \alpha],\quad a \sim N(\mu(s;\theta)+z, \sigma(s;\theta))$ 5 samples (policy SGD), with linear convergence in iterations, matching or closely tracking the nonrobust settings (Mandal et al., 1 Mar 2025, Li et al., 2022).

6. Practical Guidance and Hyperparameter Selection

Effective deployment of FRPO involves properly tuning robustness parameters:

Parameter	Typical Range/Default	Guidance
Perturb strength α	0.5 (RPO); [0.1, 3.0]	Increase if entropy collapses, decrease if reward degrades (Rahman et al., 2022)
KL-ball λ	0.2–2.0	Lower λ: more robust, less reward; match to expected post-fine-tuning drift (Sabbaghi et al., 9 Feb 2026)
DRO radius ρ	0.2–0.4	Select via OOD validation; larger ρ for more robustness at some in-distribution cost (Mandal et al., 1 Mar 2025)
SGD/minibatch	Standard PPO/LORA values	As used in vanilla RLHF pipelines

Standard architecture choices (two-layer MLP for control, LLaMA/Mistral-family LLMs for RLHF fine-tuning) apply. In practice, RPO/FRO requires no additional entropy bonus because the perturbations induce sufficient exploration. For DRO-FRPO, the key additional step is solving the robust minibatch reweighting, implemented efficiently in common Python solvers.

7. Theoretical and Practical Implications

FRPO introduces a principled mechanism to trade off expected performance against reward variability or OOD vulnerability. The theoretical underpinnings relate to coherent and entropic risk measures, minimax optimization (max–min over policy or data balls), and robust mirror-descent methods. In RLHF, this directly curtails catastrophic forgetting of capabilities under standard fine-tuning, addressing a fundamental limitation of current LLM alignment pipelines (Sabbaghi et al., 9 Feb 2026). In robust classical RL, FRPO operationalizes risk-sensitive planning and directly bounds the worst-case degradation under model error (Li et al., 2022).

A plausible implication is that as RL and RLHF methods encounter more diverse and rapidly evolving downstream requirements, FRPO will become central to ensuring retention, safety, and transferability of both behaviors and capabilities across domains. Extensions remain to be fully realized, including robust RL with nonlinear policy classes, efficient scaling of reweighting algorithms, and addressing structural or coupled uncertainties beyond the (s,a)-rectangular case (Mandal et al., 1 Mar 2025, Li et al., 2022).

References

"Robust Policy Optimization in Deep Reinforcement Learning" (Rahman et al., 2022)
"Robust Policy Optimization to Prevent Catastrophic Forgetting" (Sabbaghi et al., 9 Feb 2026)
"Distributionally Robust Reinforcement Learning with Human Feedback" (Mandal et al., 1 Mar 2025)
"First-order Policy Optimization for Robust Markov Decision Process" (Li et al., 2022)