Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 85 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 98 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Kimi K2 202 tok/s Pro
2000 character limit reached

Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning (2403.05385v5)

Published 8 Mar 2024 in cs.LG

Abstract: We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2021.
  2. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning (ICML), 2020.
  3. Make the minority great again: First-order regret bound for contextual bandits. In International Conference on Machine Learning (ICML), 2018.
  4. Fitted Q-iteration in continuous action-space MDPs. In Neural Information Processing Systems (NeurIPS), 2007.
  5. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 2008.
  6. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning (ICML), 2017.
  7. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013.
  8. A distributional perspective on reinforcement learning. In International Conference on Machine Learning (ICML), 2017.
  9. Distributional Reinforcement Learning. MIT Press, 2023.
  10. D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 1995.
  11. The importance of pessimism in fixed-dataset policy optimization. In International Conference on Learning Representations (ICLR), 2021.
  12. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning (ICML), 2019.
  13. A characterization of superlinear convergence and its application to quasi-Newton methods. Mathematics of computation, 1974.
  14. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research (JMLR), 2005.
  15. Amir-massoud Farahmand. Regularization in reinforcement learning. PhD thesis, University of Alberta, 2011.
  16. Efficient first-order contextual bandits: prediction, allocation, and triangular discrimination. Neural Information Processing Systems (NeurIPS), 2021.
  17. Offline reinforcement learning: fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919, 2021.
  18. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 1997.
  19. Exploration via linearly perturbed loss minimisation. arXiv preprint arXiv:2311.07565, 2023.
  20. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning (ICML), 2020a.
  21. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory (COLT), 2020b.
  22. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning (ICML), 2021.
  23. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), 2002.
  24. Information theoretic regret bounds for online nonlinear control. Neural Information Processing Systems (NeurIPS), 2020.
  25. Value function approximation in reinforcement learning using the Fourier basis. In AAAI Conference on Artificial Intelligence (AAAI), 2011.
  26. Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research (JMLR), 2012.
  27. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and Mdps. Neural Information Processing Systems (NeurIPS), 2020.
  28. Michael Lederman Littman. Algorithms for sequential decision-making. PhD thesis, Brown University, 1996.
  29. Small-loss bounds for online learning with partial information. Mathematics of Operations Research, 2022.
  30. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research (JAIR), 2018.
  31. Human-level control through deep reinforcement learning. Nature, 2015.
  32. Andrew William Moore. Efficient memory-based learning for robot control. Technical report, University of Cambridge, 1990.
  33. Rémi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning (ICML), 2003.
  34. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research (JMLR), 2008.
  35. Gergely Neu. First-order regret bounds for combinatorial semi-bandits. In Conference on Learning Theory (COLT), 2015.
  36. First- and second-order bounds for adversarial linear contextual bandits. In Neural Information Processing Systems (NeurIPS), 2023.
  37. Statistical linear estimation with penalized estimators: an application to reinforcement learning. In International Conference on Machine Learning (ICML), 2012.
  38. Martin Riedmiller. Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning (ECML), 2005.
  39. Reinforcement learning: an introduction. MIT press, 2018.
  40. F. Topsoe. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 2000.
  41. First-order regret in reinforcement learning with linear function approximation: A robust estimation approach. In International Conference on Machine Learning (ICML), 2022.
  42. The benefits of being distributional: small-loss bounds for reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  43. More benefits of being distributional: Second-order bounds for reinforcement learning. arXiv preprint arXiv:2402.07198, 2024.
  44. Batch value-function approximation with only realizability. In International Conference on Machine Learning (ICML), 2021.
  45. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning (ICML), 2019.
  46. Tong Zhang. Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press, 2023.
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com