Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning
Abstract: We propose training fitted Q-iteration with log-loss (FQI-log) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving small-cost bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
- Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2021.
- An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning (ICML), 2020.
- Make the minority great again: First-order regret bound for contextual bandits. In International Conference on Machine Learning (ICML), 2018.
- Fitted Q-iteration in continuous action-space MDPs. In Neural Information Processing Systems (NeurIPS), 2007.
- Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 2008.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning (ICML), 2017.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013.
- A distributional perspective on reinforcement learning. In International Conference on Machine Learning (ICML), 2017.
- Distributional Reinforcement Learning. MIT Press, 2023.
- D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 1995.
- The importance of pessimism in fixed-dataset policy optimization. In International Conference on Learning Representations (ICLR), 2021.
- Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning (ICML), 2019.
- A characterization of superlinear convergence and its application to quasi-Newton methods. Mathematics of computation, 1974.
- Tree-based batch mode reinforcement learning. Journal of Machine Learning Research (JMLR), 2005.
- Amir-massoud Farahmand. Regularization in reinforcement learning. PhD thesis, University of Alberta, 2011.
- Efficient first-order contextual bandits: prediction, allocation, and triangular discrimination. Neural Information Processing Systems (NeurIPS), 2021.
- Offline reinforcement learning: fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919, 2021.
- A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 1997.
- Exploration via linearly perturbed loss minimisation. arXiv preprint arXiv:2311.07565, 2023.
- Reward-free exploration for reinforcement learning. In International Conference on Machine Learning (ICML), 2020a.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory (COLT), 2020b.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning (ICML), 2021.
- Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), 2002.
- Information theoretic regret bounds for online nonlinear control. Neural Information Processing Systems (NeurIPS), 2020.
- Value function approximation in reinforcement learning using the Fourier basis. In AAAI Conference on Artificial Intelligence (AAAI), 2011.
- Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research (JMLR), 2012.
- Bias no more: high-probability data-dependent regret bounds for adversarial bandits and Mdps. Neural Information Processing Systems (NeurIPS), 2020.
- Michael Lederman Littman. Algorithms for sequential decision-making. PhD thesis, Brown University, 1996.
- Small-loss bounds for online learning with partial information. Mathematics of Operations Research, 2022.
- Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research (JAIR), 2018.
- Human-level control through deep reinforcement learning. Nature, 2015.
- Andrew William Moore. Efficient memory-based learning for robot control. Technical report, University of Cambridge, 1990.
- Rémi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning (ICML), 2003.
- Finite-time bounds for fitted value iteration. Journal of Machine Learning Research (JMLR), 2008.
- Gergely Neu. First-order regret bounds for combinatorial semi-bandits. In Conference on Learning Theory (COLT), 2015.
- First- and second-order bounds for adversarial linear contextual bandits. In Neural Information Processing Systems (NeurIPS), 2023.
- Statistical linear estimation with penalized estimators: an application to reinforcement learning. In International Conference on Machine Learning (ICML), 2012.
- Martin Riedmiller. Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning (ECML), 2005.
- Reinforcement learning: an introduction. MIT press, 2018.
- F. Topsoe. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 2000.
- First-order regret in reinforcement learning with linear function approximation: A robust estimation approach. In International Conference on Machine Learning (ICML), 2022.
- The benefits of being distributional: small-loss bounds for reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- More benefits of being distributional: Second-order bounds for reinforcement learning. arXiv preprint arXiv:2402.07198, 2024.
- Batch value-function approximation with only realizability. In International Conference on Machine Learning (ICML), 2021.
- Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning (ICML), 2019.
- Tong Zhang. Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.