Offline-to-Online RL Benchmarks

Updated 6 May 2026

Offline-to-Online RL benchmarks are standardized evaluation protocols that combine offline pretraining on fixed datasets with online fine-tuning to assess policy robustness, sample efficiency, and safety.
Key benchmark suites such as D4RL, D5RL, ARB, and FailureBench offer diverse environments and metrics, enabling systematic comparisons under varied dynamics, failure conditions, and distribution shifts.
Methodological innovations like adaptive data mixing, unified on/off-policy loss functions, and safety-critical recovery strategies provide practical guidelines for robust real-world deployment.

Offline-to-Online Reinforcement Learning (O2O RL) benchmarks evaluate algorithms that leverage a fixed offline dataset for pretraining and subsequently improve or adapt policy performance via further online interactions with the environment. These benchmarks are central to unifying the empirical study of sample-efficient RL, realistic transfer, generalization under distributional shift, and deployment robustness. O2O RL benchmarks distinguish themselves from purely offline RL or online RL by their treatment of the interface, scheduling, and composition between static and dynamically acquired data, as well as by their relevance to critical applications such as real-world robotics, safety-aware learning, sim-to-real transfer, and resource-constrained deployment.

1. Benchmark Task Suites and Environments

O2O RL benchmarks utilize domains designed to test an algorithm’s ability to exploit both pre-collected data and live environment experience. Key benchmark suites include:

D4RL Benchmarks: Comprising MuJoCo locomotion (HalfCheetah, Hopper, Walker2d), Adroit manipulation (Pen, Hammer, Door, Relocate), AntMaze navigation, and multi-task Kitchen domains. Each provides distinct datasets: “medium,” “medium-replay,” and “medium-expert” for locomotion; “human” and “cloned” for Adroit; various maze configurations for AntMaze; and Kitchen variants with compositional subgoals. Offline data sources follow canonical protocols—SAC-trained or human policies for locomotion and manipulation, with dataset sizes ∼1 M transitions per task (Lei et al., 2023).
Real-World Hardware and Safety Extensions: Integration of real robot platforms (e.g., Unitree Go1 quadruped, Franka Kitchen arm) per (Lei et al., 2023), or resource-constrained deployment (RC-D4RL), delivers critical distribution and observation shifts.
D5RL: Expands O2O coverage to A1 quadruped (state-based interpolation/extrapolation, hiking), Franka Kitchen (image-based, combinatorial and randomized variants), and WidowX sorting (multi-stage manipulation with play data and task-specific sub-datasets). Dataset statistics specify coverage, heterogeneity, and offline-to-online transfer regimes (Rafailov et al., 2024).
FailureBench: Evaluates learning under real-world-relevant failure constraints (MetaWorld Sawyer tasks with hard failures triggering intervention), includes diverse failure conditions, scripted recovery demonstrations, and explicit human-reset counting (Li et al., 12 Jan 2026).
ODRL (Off-Dynamics RL): Targets dynamics-mismatch/transfer, systematically varying friction, gravity, kinematic, and morphological parameters (80 unique source–target transitions in locomotion, navigation, dexterity) with low-budget offline data and controlled online adaptation (Lyu et al., 2024).
Racing Game/Other Domains: OfflineMania (Unity-based TrackMania-like racing), supports image-based state, dense/sparse rewards, mixture datasets, and online fine-tuning (PPO/SAC/IQL hybrid baselines) (Macaluso et al., 2024).

2. Offline-to-Online Experimental Protocols

Standard O2O RL evaluation protocols involve the following stages and metrics:

Phases:

Offline pretraining: Algorithm is trained on $\mathcal{D}_{\mathrm{offline}}$ —pre-collected transitions with fixed coverage and variable quality.
Online fine-tuning: Agent is initialized from offline solution, then interacts with the environment up to a constrained budget (ranging from $10^5$ to $10^6$ steps), mixing newly acquired data into the buffer.

Evaluation Workflow (e.g., Uni-O4 (Lei et al., 2023)):
- Fit ensemble of BC clones on offline data.
- Learn value functions via expectile regression; fit dynamics models if model-based offline policy evaluation is used.
- Run multiple policy improvement updates, selecting the best via offline policy evaluation (OPE).
- Initialize online training from the best offline policy; continue RL updates with joint/segregated buffer sampling as per method.
Metrics:
- Average return, computed as mean undiscounted or normalized episodic return over $n$ evaluation trajectories and seeds.
- Sample efficiency: Performance versus environment steps during fine-tuning.
- Stability: Absence of performance drop at offline-to-online handshake; low variance across seeds.
- Asymptotic return: Final converged score within the allowed online budget.
- Safety/Faults: Explicit “number of intervention-requiring failures” in tasks like FailureBench (Li et al., 12 Jan 2026).

3. Comparative Results and Algorithmic Insights

Empirical studies on canonical tasks systematically compare standard and novel approaches for offline-to-online adaptation:

Environment/Task	CQL	TD3+BC	IQL	BPPO	ATAC	Uni-O4
halfcheetah-medium	44.0	48.3	47.4	44.0	54.3	52.6
hopper-medium	58.5	59.3	66.3	93.9	102.8	104.4
walker2d-medium	72.5	83.7	78.3	83.6	91.0	90.2
Adroit (sum)	93.6	52.2	118.1	291.4	180.2	288.6
Kitchen (sum)	144.6	47.5	159.8	211.0	3.0	216.9
Overall	936.7	777.1	970.3	1253.4	1002.1	1322.0

Uni-O4 matches or outperforms all baselines in 14/20 D4RL tasks, with especially rapid, stable online adaptation and no performance drop at phase transition (Lei et al., 2023).

ARB (Song et al., 11 Dec 2025) demonstrates that adaptive, on-policy-aware replay buffer sampling yields up to 50% relative gains on low-quality data and consistent improvements across AntMaze, outperforming both static (fixed-mix) and learning-intensive (BERB) prioritization.

Failure-aware learning—FARL (Li et al., 12 Jan 2026)—incorporates world-model-based safety critics and recovery policies, reducing intervention failures by up to 73.1% and demonstrating robust task performance in real-world robot settings.

In dynamics-transfer benchmarks (ODRL (Lyu et al., 2024)), methods such as H2O, BC_VGDF, CQL_SAC, and mixed replay show complementary strengths depending on shift type, with no universal best: conservative-value baselines excel when source data is expert, but their advantage can disappear under strong or poorly covered shifts.

4. Methodological Innovations in O2O RL Evaluation

Recent O2O RL benchmarks and frameworks integrate the following key methodological advances:

Unified On-policy and Off-policy Objectives: Uni-O4 (Lei et al., 2023) employs a shared multi-step on-policy surrogate loss for both offline and online phases, enabling seamless transition and avoiding over-conservatism.
Behavioral Adaptivity: BAQ (Zu et al., 5 Nov 2025) introduces behavior-adaptive weighting via an implicit BC model, ensuring conservative updates under uncertainty and faster adaptation when the policy is closer to offline support.
Adaptive Data Mixing: ARB (Song et al., 11 Dec 2025) implements trajectory-level “on-policyness” weighting, dynamically prioritizing buffer sampling for early stability and later online performance.
Decoupled Exploration–Exploitation: OOO (Mark et al., 2023) physically separates exploration (by an optimistic policy with intrinsic bonus) from exploitation (retrained policy ignoring exploration bias), yielding up to 26% higher returns and state-of-the-art performance on sparse-reward and incomplete data.
Dynamics-Aware and Meta-Adaptive Approaches: H2O⁺ (Niu et al., 2023) and MOORL (Chaudhary et al., 11 Jun 2025) blend offline and online Bellman updates with learned dynamics-discriminator weighting or meta-learning outer loops, providing stability and flexibility under domain shift, with low computational overhead.
Resource-Constrained Deployment: RC-D4RL (Regatti et al., 2021) benchmarks measure performance gaps and transfer learning capability when offline data contains privileged features not available online.

5. Practical Guidelines, Pitfalls, and Comparative Recommendations

Analysis of benchmark protocols and empirical results yields the following design recommendations:

Offline Dataset Diversity: Larger, more diverse replay-style datasets improve robustness of fine-tuning (e.g., Medium-Replay vs. Medium in D4RL; D5RL’s play and randomized settings stress generalization beyond narrow expert trajectories) (Rafailov et al., 2024).
Stability at Offline–Online Interface: Algorithms should report early learning curves, not just final return, to expose initial performance dip or “offline shock” (Uni-O4, ARB) (Lei et al., 2023, Song et al., 11 Dec 2025).
Replay Scheduling: Adaptive or trajectory-level prioritization (e.g., ARB) is preferable to fixed mixing or expensive meta-learned panelizations (Song et al., 11 Dec 2025).
Dynamics Gap Robustness: Evaluate under controlled variations of transition dynamics (H2O⁺, ODRL) and report both adaptation speed and asymptotic scores (Niu et al., 2023, Lyu et al., 2024).
Safety Evaluation: Include evaluation of intervention-requiring failures as a primary metric, not just reward (FailureBench) (Li et al., 12 Jan 2026).
Resource Constraints: Report policy performance under both privileged (offline) and limited (online) features and use transfer–distillation baselines (RC-D4RL) (Regatti et al., 2021).
Hyperparameter and Runtime Reporting: Benchmarks increasingly highlight practical overhead—Uni-O4 runs up to 36x faster in wall-clock than ensemble-based Off2on (Lei et al., 2023); MOORL matches large-ensemble SOTA with far lower gradient and memory cost (Chaudhary et al., 11 Jun 2025).

6. Directions for Future Benchmark Development

Suggested extensions and paradigm shifts, grounded in recent empirical trends, include:

Increased realism and task diversity: Benchmarks such as D5RL and FailureBench emphasize image-based observation, multi-task composition, and safety-critical failure modes.
Vision and Representation Learning Bottleneck: Many O2O RL methods show minimal gain on image-based manipulation despite robust performance in state-based domains, indicating a need for integrated representation learning protocols (Rafailov et al., 2024).
Systematic stress-testing: Inclusion of incomplete and low-quality datasets to probe algorithmic adaptivity (OOO, RC-D4RL, ODRL) (Mark et al., 2023, Regatti et al., 2021, Lyu et al., 2024).
Multi-phase and adaptive workflows: New methods embrace not only offline→online, but cyclic/decoupled offline–online–offline paradigms, especially for robustly exploiting exploration data (OOO) (Mark et al., 2023).
Open-source, reproducible code: All modern benchmarks provide comprehensive codebases and single-file baselines (e.g., Uni-O4, ODRL), supporting head-to-head algorithmic comparison and rapid methodological extension (Lei et al., 2023, Lyu et al., 2024).

7. Representative Benchmark Comparison Table

Suite / Domain	Offline Data	Online Phase	Key Metrics	Notable Features
D4RL/Uni-O4	μ, μr, μe	1 M steps (sim)	Avg return, sample-eff., stbl.	Ensemble BC, multi-step PPO, OPE
D5RL	play, demo	200–500 K steps	Normalized Return, “stitch.”	Visual, multi-stage, “hiking”
ARB (D4RL)	r, mr, m, me	1 M steps	Norm. return, online ratio	Traj.-level on-policyness
FailureBench	scripted, fail	10⁶ steps (real)	Failures, avg return	World-model safety, recovery
ODRL	exp, med, rnd	10⁵ steps	Norm. score, dyn. gap	Dynamics shift, domain classifier
RC-D4RL	prism reduced	—	Norm. return, feature gap	Resource-constrained policies
OfflineMania	mix, exp, med	1 M steps	Laps, reward, completion	Game, vision-based, “stitching”

Dataset codes: μ = medium, μr = medium-replay, μe = medium-expert, r = random, med = medium, exp = expert.

In summary, offline-to-online RL benchmarks have evolved to rigorously test and accelerate algorithmic progress in sample-efficient transfer, stability under shift, safety-aware adaptation, and practical deployment across simulated and real-world settings. Key benchmark characteristics now include trajectory composition, adaptive data mixing, safety constraints, integration of vision and heterogeneous sensing, and reproducibility across a diverse suite of environments and evaluation protocols (Lei et al., 2023, Song et al., 11 Dec 2025, Mark et al., 2023, Rafailov et al., 2024, Li et al., 12 Jan 2026, Niu et al., 2023, Lyu et al., 2024, Regatti et al., 2021, Chaudhary et al., 11 Jun 2025, Zu et al., 5 Nov 2025, Macaluso et al., 2024).