Failure-Aware Offline-to-Online RL (FARL)
- The paper introduces FARL, a reinforcement learning framework that combines offline policy pre-training with online adaptation to manage failure risks.
- It employs adversarial fine-tuning, safety critics, and recovery policies—augmented by human-in-the-loop feedback—to maintain task performance under perturbations.
- Empirical evaluations in robotics and continuous-control tasks demonstrate notable failure reduction and improved returns, underscoring its practical impact.
Failure-Aware Offline-to-Online Reinforcement Learning (FARL) refers to a family of reinforcement learning (RL) methodologies explicitly designed to bridge the gap between sample-efficient offline training and robust, safe adaptation in online environments prone to hazardous or intervention-requiring failures. Motivated by challenges in deploying RL for safety-critical robotics and manipulation tasks, FARL incorporates algorithmic strategies that proactively mitigate the risk of performance degradation under distributional shift, actuator faults, or catastrophic exploration, while maintaining or improving task performance during policy fine-tuning.
1. Formal Problem Definition and Core Challenges
FARL is typically formalized as a Constrained Markov Decision Process (CMDP)
where and denote the state and action spaces; is the transition kernel; is the reward function; is an intervention-requiring failure indicator; and control temporal discounting for reward and failure risk, respectively; and specifies a tolerated upper bound on the discounted multi-step failure probability starting from the initial state. The central objective is the constrained maximization
where quantifies the -step, risk-discounted failure probability. A common alternative is a Lagrangian penalty formulation: where controls the reward-safety trade-off (Li et al., 12 Jan 2026).
This problem setting captures the dual imperative of high task performance and active risk minimization during online adaptation after offline policy optimization, especially in scenarios where failure states require human intervention (e.g., robot damaging fragile objects, losing workspace control).
2. Algorithmic Frameworks in FARL
Approaches to FARL instantiate the above principle via hybrid offline-to-online RL workflows, combining conservative policy pre-training, online fine-tuning under explicit perturbations, and explicit safety modules. Representative methodologies include:
A. Adversarial Fine-Tuning with Perturbation Injection
Offline pre-training is conducted using a conservative actor-critic algorithm (such as TD3+BC) to maximize expected -value with a behavioral cloning penalty that constrains the policy within the distribution of the static dataset . Online fine-tuning then introduces adversarial perturbations into executed actions: with perturbation sampled from $0$ (normal), random uniform, or adversarial (e.g., Differential Evolution-generated) distributions, and injected with probability . Fine-tuning updates are performed via standard actor-critic (TD3) procedures with no behavior cloning regularization. The perturbation probability can be adapted via curriculum strategies (linear or performance-aware), balancing robustness to perturbations and nominal control performance (Ayabe et al., 15 Oct 2025).
B. Safety Critic and Recovery Policy Integration
A world-model-based critic is trained to jointly predict state dynamics, reward, value, and, crucially, failure signals in latent space. This critic allows online risk estimation: where and .
A recovery policy is trained offline from demonstrations in near-failure scenarios via behavior cloning and Uni-O4 fine-tuning (a PPO-style method), and is invoked online when predicted failure probability exceeds , temporarily overriding the primary task policy:
- If , execute recovery.
- Else, execute nominal task policy.
The overall online fine-tuning protocol interleaves task and recovery actions, updating the task policy via PPO on transitions not involving failure (Li et al., 12 Jan 2026).
C. Human-in-the-Loop Model Selection and Fine-Tuning
An alternative pragmatic framework leverages supervised human feedback as a safety filter. During online execution, if the current policy's action deviates significantly from a human expert policy (in norm or discrete mismatch), the expert's action is executed and logged. Fine-tuning is then performed on this override dataset (using Bellman error for critics, and a sum of -value and imitation losses for actors). Model selection may also use a UCB (Upper Confidence Bound) multi-armed bandit approach for selecting among pre-trained offline RL models, balancing performance and safety metrics (Li et al., 2023).
3. Theoretical Properties and Trade-Offs
Formal convergence guarantees are rarely established due to the complexity of off-distribution adaptation and the introduction of safety interventions or adversarial perturbations. However:
- In adversarial fine-tuning, empirical stability is inherited from the underlying TD3 updates. While conservative offline RL ensures nominal stability, it lacks robustness to new perturbations seen during online deployment. Adversarial fine-tuning restores robustness by explicitly exposing the policy to out-of-distribution states, but can degrade nominal task performance if perturbation probability is not moderated. Adaptive curricula tie to a smoothed performance signal, balancing these effects ((Ayabe et al., 15 Oct 2025), Table 2, Figure 1).
- Risk-constrained optimization in CMDPs as in (Li et al., 12 Jan 2026) is realized through learned world models and recovery policies, but again lacks analytic bounds; intervention is determined empirically by success in predicting and averting failures.
- Human-feedback frameworks guarantee, by design, that no catastrophic action is executed without override, as all high-disagreement actions are replaced by expert actions (Li et al., 2023).
4. Empirical Results and Benchmarks
Empirical validation spans simulated continuous-control and real-world robotics environments:
| Method | Failure Reduction | Task Return Gain | Notable Evaluation Domains |
|---|---|---|---|
| Adversarial fine-tuning FARL | Robustness ↑ | Convergence ↑ | Hopper-v2, Ant-v2, HalfCheetah-v2 (Ayabe et al., 15 Oct 2025) |
| World-model+Recovery FARL | –73.1% failures | +11.3% returns | MetaWorld FailureBench, Franka Panda (Li et al., 12 Jan 2026) |
| Human-feedback selection/fine-tune | Disagreement ↓ | Online-score ↑ | MuJoCo locomotion, CityFlow (Li et al., 2023) |
- Adversarial fine-tuning achieves normalized episodic rewards of $55$–$91$ in adversarial conditions (versus for offline-only and $25$–$60$ for online-from-scratch baselines), with adaptive curricula maintaining nominal rewards while achieving comparable adversarial robustness (Ayabe et al., 15 Oct 2025).
- Integration of learned recovery policy and world-model safety critic outperforms both policy-gradient safe RL (e.g., PPO-Lagrangian, P3O, CPO) and ablated baselines (e.g., Q-based Recovery RL, MPPI planning), with up to failure reduction and return increase in real-robot manipulation (Li et al., 12 Jan 2026).
- Human-in-the-loop approaches ensure rapid identification of the best offline-trained policies within episodes and significant reduction of human disagreement rates (Li et al., 2023).
5. Benchmarking Scenarios and Metrics
Failure-aware RL algorithms are benchmarked using environments and metrics that explicitly measure intervention-requiring failures and task returns:
- FailureBench: Augmented MetaWorld tasks featuring representative failures (workspace bounds, breakable objects, obstructed pushes). Metrics include count of failure episodes, average return over steps, and generalization to held-out disturbances.
- Locomotion and manipulation: OpenAI Gym (Hopper, Ant, HalfCheetah) and real-robot tasks (Franka Panda). Rewards normalized to D4RL standard; failures measured as required human interventions or actuator fault tolerance (Ayabe et al., 15 Oct 2025, Li et al., 12 Jan 2026).
- Online-score: Weighted sum of environment reward and disagreement penalty with expert/human supervisor; supports quantitative comparison of safety-aware adaptation (Li et al., 2023).
6. Limitations and Future Directions
FARL approaches are subject to various domain and implementation limitations:
- Recovery protocols currently rely on 2D vision and color filtering for perception and are limited to single-arm manipulation, lacking tactile or multi-modal sensors (Li et al., 12 Jan 2026).
- Curriculum and intervention frequencies require environment-specific tuning; linear schedules can over-expose policies to perturbations, while adaptive curricula add hyperparameter complexity.
- Generalization to more complex robots (e.g., mobile base, dual-arm) or unobserved failure modalities remains future work, as does scalable pre-training of world models across diverse tasks and input modalities (Li et al., 12 Jan 2026).
- Human-in-the-loop strategies require access to expert policies and careful selection of override thresholds for practical deployment (Li et al., 2023).
Key directions include integration of richer sensory modalities (depth, tactile, force), extension to multi-arm and mobile platforms, cross-task generalization of learned safety models, and further formal study of performance-safety trade-offs in high-risk on-policy adaptation.