Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Natural Extension To Online Algorithms For Hybrid RL With Limited Coverage (2403.09701v2)

Published 7 Mar 2024 in cs.LG and stat.ML

Abstract: Hybrid Reinforcement Learning (RL), leveraging both online and offline data, has garnered recent interest, yet research on its provable benefits remains sparse. Additionally, many existing hybrid RL algorithms (Song et al., 2023; Nakamoto et al., 2023; Amortila et al., 2024) impose coverage assumptions on the offline dataset, but we show that this is unnecessary. A well-designed online algorithm should "fill in the gaps" in the offline dataset, exploring states and actions that the behavior policy did not explore. Unlike previous approaches that focus on estimating the offline data distribution to guide online exploration (Li et al., 2023b), we show that a natural extension to standard optimistic online algorithms -- warm-starting them by including the offline dataset in the experience replay buffer -- achieves similar provable gains from hybrid data even when the offline dataset does not have single-policy concentrability. We accomplish this by partitioning the state-action space into two, bounding the regret on each partition through an offline and an online complexity measure, and showing that the regret of this hybrid RL algorithm can be characterized by the best partition -- despite the algorithm not knowing the partition itself. As an example, we propose DISC-GOLF, a modification of an existing optimistic online algorithm with general function approximation called GOLF used in Jin et al. (2021); Xie et al. (2022a), and show that it demonstrates provable gains over both online-only and offline-only reinforcement learning, with competitive bounds when specialized to the tabular, linear and block MDP cases. Numerical simulations further validate our theory that hybrid data facilitates more efficient exploration, supporting the potential of hybrid RL in various scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Harnessing density ratios for online reinforcement learning.
  2. Auer, P. (2003). Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3(null):397–422.
  3. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
  4. Minimax regret bounds for reinforcement learning.
  5. Robust fitted-q-evaluation and iteration under sequentially exogenous unobserved confounders.
  6. Adversarially trained actor critic for offline reinforcement learning.
  7. Better exploration with optimistic actor-critic.
  8. pymdptoolbox. https://github.com/sawcordwell/pymdptoolbox.
  9. On oracle-efficient pac rl with rich observations.
  10. Provably efficient rl with rich observations via latent state decoding.
  11. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms.
  12. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
  13. Offline policy evaluation and optimization under confounding.
  14. Conservative q-learning for offline reinforcement learning.
  15. Is q-learning minimax optimal? a tight sample complexity analysis.
  16. Reward-agnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning. arXiv preprint arXiv:2305.10282.
  17. Provably good batch reinforcement learning without great exploration.
  18. Pessimism in the face of confounders: Provably efficient offline reinforcement learning in partially observable markov decision processes.
  19. Kinematic state abstraction and provably efficient rich-observation reinforcement learning.
  20. Tactical optimism and pessimism for deep reinforcement learning.
  21. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.
  22. Toward the fundamental limits of imitation learning.
  23. Bridging offline reinforcement learning and imitation learning: A tale of pessimism.
  24. Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity.
  25. Hybrid rl: Using both offline and online data can make rl efficient.
  26. Pessimistic model-based offline reinforcement learning under partial coverage.
  27. Leveraging offline data in online reinforcement learning.
  28. Provably efficient causal reinforcement learning with confounded observational data.
  29. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694.
  30. The role of coverage in online reinforcement learning. arXiv preprint arXiv:2210.04157.
  31. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning.
  32. Zanette, A. (2023). When is realizability sufficient for off-policy reinforcement learning?
  33. Learning near optimal policies with low inherent bellman error.
  34. Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, pages 2730–2775. PMLR.
  35. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com