An Invitation to Deep Reinforcement Learning (2312.08365v3)
Abstract: Training a deep neural network to maximize a target objective has become the standard recipe for successful machine learning over the last decade. These networks can be optimized with supervised learning, if the target objective is differentiable. For many interesting problems, this is however not the case. Common objectives like intersection over union (IoU), bilingual evaluation understudy (BLEU) score or rewards cannot be optimized with supervised learning. A common workaround is to define differentiable surrogate losses, leading to suboptimal solutions with respect to the actual objective. Reinforcement learning (RL) has emerged as a promising alternative for optimizing deep neural networks to maximize non-differentiable objectives in recent years. Examples include aligning LLMs via human feedback, code generation, object detection or control problems. This makes RL techniques relevant to the larger machine learning audience. The subject is, however, time intensive to approach due to the large range of methods, as well as the often very theoretical presentation. In this introduction, we take an alternative approach, different from classic reinforcement learning textbooks. Rather than focusing on tabular problems, we introduce reinforcement learning as a generalization of supervised learning, which we first apply to non-differentiable objectives and later to temporal problems. Assuming only basic knowledge of supervised learning, the reader will be able to understand state-of-the-art deep RL algorithms like proximal policy optimization (PPO) after reading this tutorial.
- Joshua Achiam. Spinning Up in Deep Reinforcement Learning. url: https://spinningup.openai.com, 2018a.
- Joshua Achiam. Simplified PPO-Clip Objective. url: https://drive.google.com/file/d/1PDzn9RPvaXjJFZkGeapMHbHGiWWW20Ey/view, 2018b.
- Deep reinforcement learning at the edge of the statistical precipice. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- What matters in on-policy reinforcement learning? A large-scale empirical study. arXiv.org, 2006.05990, 2020.
- Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 2017.
- A survey on intrinsic motivation in reinforcement learning. arXiv.org, 1908.06976, 2019.
- Agent57: Outperforming the atari human benchmark. In Proc. of the International Conf. on Machine learning (ICML), 2020.
- An actor-critic algorithm for sequence prediction. In Proc. of the International Conf. on Learning Representations (ICLR), 2017.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv.org, 2204.05862, 2022.
- Leemon C Baird. Advantage updating. Technical report, Technical report wl-tr-93-1146, Wright Patterson AFB OH, 1993.
- Bram Bakker. Reinforcement learning with long short-term memory. In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani (eds.), Advances in Neural Information Processing Systems (NIPS), 2001.
- Reinforcement learning and its relationship to supervised learning. Handbook of learning and approximate dynamic programming, 2004.
- Recognition in terra incognita. In Proc. of the European Conf. on Computer Vision (ECCV), 2018.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research (JAIR), 2013.
- A distributional perspective on reinforcement learning. In Proc. of the International Conf. on Machine learning (ICML), 2017.
- Applied dynamic programming. RAND Corporation, 1962.
- Dota 2 with large scale deep reinforcement learning. arXiv.org, 1912.06680, 2019.
- Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 1st ed. 2006 edition, October 2006.
- Towards deeper deep reinforcement learning with spectral normalization. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Training diffusion models with reinforcement learning. arXiv.org, 2305.13301, 2023.
- Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 2018.
- Superhuman ai for multiplayer poker. Science, 2019.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Dopamine: A research framework for deep reinforcement learning. arXiv.org, 1812.06110, 2018.
- Stabilizing off-policy deep reinforcement learning from pixels. In Proc. of the International Conf. on Machine learning (ICML), 2022.
- GRI: general reinforced imitation and its application to vision-based autonomous driving. arXiv.org, 2111.08575, 2021.
- Redeeming intrinsic rewards via constrained optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Exploring simple siamese representation learning. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Faulty reward functions in the wild. url: https://openai.com/research/faulty-reward-functions, 2016.
- Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International Conference on Computers and Games (ICCG), 2006.
- Implicit quantile networks for distributional reinforcement learning. In Proc. of the International Conf. on Machine learning (ICML), 2018a.
- Distributional reinforcement learning with quantile regression. In Proc. of the Conf. on Artificial Intelligence (AAAI), 2018b.
- Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022.
- Off-policy actor-critic. arXiv.org, 1205.4839, 2012.
- Openai baselines. https://github.com/openai/baselines, 2017.
- An empirical investigation of the challenges of real-world reinforcement learning. arXiv.org, 2003.11881, 2020.
- Pink noise is all you need: Colored noise exploration in deep reinforcement learning. In Proc. of the International Conf. on Learning Representations (ICLR), 2023.
- First return, then explore. Nature, 2021.
- Implementation matters in deep policy gradients: A case study on PPO and TRPO. Proc. of the International Conf. on Learning Representations (ICLR), 2020.
- IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In Jennifer G. Dy and Andreas Krause (eds.), Proc. of the International Conf. on Machine learning (ICML), 2018.
- A theoretical analysis of deep q-learning. In Proceedings of the Conference on Learning for Dynamics and Control, (L4DC), 2020.
- DPOK: reinforcement learning for fine-tuning text-to-image diffusion models. arXiv.org, 2305.16381, 2023.
- Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 2022.
- Noisy networks for exploration. In Proc. of the International Conf. on Learning Representations (ICLR), 2018.
- An introduction to deep reinforcement learning. Found. Trends Mach. Learn., 2018.
- Addressing function approximation error in actor-critic methods. In Proc. of the International Conf. on Machine learning (ICML), 2018.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020.
- Abhijit Gosavi. Reinforcement learning: A tutorial survey and recent advances. INFORMS J. Comput., 2009.
- Learning surrogate losses. arXiv.org, 1905.10108, 2019.
- Variance reduction techniques for gradient estimates in reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 2001.
- Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research (JMLR), 2004.
- On calibration of modern neural networks. In Proc. of the International Conf. on Machine learning (ICML), 2017.
- Benchmarking offline reinforcement learning on real-robot hardware. In Proc. of the International Conf. on Learning Representations (ICLR), 2023.
- Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proc. of the International Conf. on Machine learning (ICML), 2018a.
- Soft actor-critic algorithms and applications. arXiv.org, 1812.05905, 2018b.
- Dream to control: Learning behaviors by latent imagination. In Proc. of the International Conf. on Learning Representations (ICLR), 2020.
- Mastering atari with discrete world models. In Proc. of the International Conf. on Learning Representations (ICLR), 2021.
- Mastering diverse domains through world models. arXiv.org, 2023.
- Reinforcement learning: A tutorial. WL/AAFC, WPAFB Ohio, 1996.
- Hado Hasselt. Double q-learning. Advances in Neural Information Processing Systems (NeurIPS), 2010.
- Deep reinforcement learning with double q-learning. In Proc. of the Conf. on Artificial Intelligence (AAAI), 2016.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition. Springer, 2009.
- Deep recurrent q-learning for partially observable mdps. In Proc. of the Conf. on Artificial Intelligence (AAAI), 2015.
- Deep residual learning for image recognition. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
- Deep reinforcement learning that matters. In Proc. of the Conf. on Artificial Intelligence (AAAI), 2018.
- Rainbow: Combining improvements in deep reinforcement learning. In Proc. of the Conf. on Artificial Intelligence (AAAI), 2018.
- Distributed prioritized experience replay. arXiv.org, 2018.
- Metricopt: Learning to optimize black-box evaluation metrics. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.
- The 37 implementation details of proximal policy optimization. In ICLR Blog Track, 2022a. URL https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/.
- Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research (JMLR), 2022b.
- Reinforcement learning: a survey. In Machine Learning and Information Processing (ICMLIP), 2021.
- Reinforcement learning: A survey. Journal of Artificial Intelligence Research (JAIR), 1996.
- Model based reinforcement learning for atari. In Proc. of the International Conf. on Learning Representations (ICLR), 2020.
- Recurrent experience replay in distributed reinforcement learning. In Proc. of the International Conf. on Learning Representations (ICLR), 2019.
- Human-level atari 200x faster. In Proc. of the International Conf. on Learning Representations (ICLR), 2023.
- Champion-level drone racing using deep reinforcement learning. Nature, 2023.
- Auto-encoding variational bayes. Proc. of the International Conf. on Learning Representations (ICLR), 2014.
- Actor-critic algorithms. In Advances in Neural Information Processing Systems (NIPS), pp. 1008–1014, 1999.
- Learning multiple layers of features from tiny images. 2009.
- Reward-conditioned policies. arXiv.org, 1912.13465, 2019.
- Reinforcement learning with augmented data. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Deep reinforcement learning: A state-of-the-art walkthrough. Journal of Artificial Intelligence Research (JAIR), 2020.
- Sergey Levine. Cs 285: Deep reinforcement learning. url: https://rail.eecs.berkeley.edu/deeprlcourse/, 2023.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv.org, 2020.
- A short tutorial on reinforcement learning. In Intelligent Information Processing II, 2004.
- Yuxi Li. Deep reinforcement learning: An overview. arXiv.org, 1701.07274, 2017.
- Yuxi Li. Deep reinforcement learning. arXiv.org, 1810.06339, 2018.
- CIRL: controllable imitative reinforcement learning for vision-based self-driving. In Proc. of the European Conf. on Computer Vision (ECCV), 2018.
- Continuous control with deep reinforcement learning. In Proc. of the International Conf. on Learning Representations (ICLR), 2016.
- Long Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 1992.
- Faster sorting algorithms discovered using deep reinforcement learning. Nature, 2023.
- Simulation-based optimization of markov reward processes. IEEE Trans. on Automatic Control (TAC), 2001.
- The monte carlo method. Journal of the American Statistical Association (JASA), 1949.
- Exploration in gradient-based reinforcement learning. Technical Report, 2001.
- Human-level control through deep reinforcement learning. Nature, 2015.
- Asynchronous methods for deep reinforcement learning. In Proc. of the International Conf. on Machine learning (ICML), 2016.
- Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 2017.
- Deep reinforcement learning: An overview. In Proceedings of Intelligent Systems Conference (IntelliSys), 2016.
- Language understanding for text-based games using deep reinforcement learning. In Proc. of the Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2015.
- Policy invariance under reward transformations: Theory and application to reward shaping. In Proc. of the International Conf. on Machine learning (ICML), 1999.
- The primacy bias in deep reinforcement learning. In Proc. of the International Conf. on Machine learning (ICML), 2022.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Time limits in reinforcement learning. In Proc. of the International Conf. on Machine learning (ICML), 2018.
- Incremental multi-step q-learning. Machine Learning, 1996.
- Deepmimic: example-guided deep reinforcement learning of physics-based character skills. Communications of the ACM, 2018.
- Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 2022.
- Tuning computer vision models with task rewards. In Proc. of the International Conf. on Machine learning (ICML), 2023.
- A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- Direct preference optimization: Your language model is secretly a reward model. arXiv.org, 2305.18290, 2023.
- Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research (JMLR), 2021.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
- Arthur L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 1959.
- Prioritized experience replay. arXiv.org, 2015.
- The phenomenon of policy churn. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Jürgen Schmidhuber. Reinforcement learning upside down: Don’t predict rewards - just map them to actions. arXiv.org, 1912.02875, 2019.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 2020.
- Trust region policy optimization. In Proc. of the International Conf. on Machine learning (ICML), 2015.
- High-dimensional continuous control using generalized advantage estimation. In Proc. of the International Conf. on Learning Representations (ICLR), 2016.
- Proximal policy optimization algorithms. arXiv.org, 1707.06347, 2017.
- Bigger, better, faster: Human-level atari with human-level efficiency. In Proc. of the International Conf. on Machine learning (ICML), Proceedings of Machine Learning Research, 2023.
- Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Inf. Theory, 1980.
- David Silver. Lectures on reinforcement learning. url: https://www.davidsilver.uk/teaching/, 2015.
- Deterministic policy gradient algorithms. In Proc. of the International Conf. on Machine learning (ICML), 2014.
- Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Mastering the game of go without human knowledge. Nature, 2017.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 2018.
- Training deep neural networks via direct loss minimization. In Proc. of the International Conf. on Machine learning (ICML), 2016.
- Training agents using upside-down reinforcement learning. arXiv.org, 1912.02877, 2019.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS), 1999.
- Csaba Szepesvári. Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 2010.
- Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 1995.
- Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 connectionist models summer school, 1993.
- End-to-end model-free reinforcement learning for urban driving using implicit affordances. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020.
- John N. Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine Learning, 1994.
- The mirage of action-dependent baselines in reinforcement learning. In Proc. of the International Conf. on Machine learning (ICML), 2018.
- Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv.org, 1707.08817, 2017.
- Mathukumalli Vidyasagar. A tutorial introduction to reinforcement learning. arXiv.org, 2304.00803, 2023.
- Grandmaster level in starcraft II using multi-agent reinforcement learning. Nature, 2019.
- Diffusion model alignment using direct preference optimization. arXiv.org, 2311.12908, 2023.
- Deep reinforcement learning: a survey. Frontiers Inf. Technol. Electron. Eng., 2020a.
- Neural policy gradient methods: Global optimality and rates of convergence. In Proc. of the International Conf. on Learning Representations (ICLR), 2020b.
- Deep reinforcement learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Q-learning. Machine Learning, 1992.
- Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.
- The optimal reward baseline for gradient-based reinforcement learning. In UAI ’01: Proc. of the Conference in Uncertainty in Artificial Intelligence, 2001.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
- Function optimization using connectionist reinforcement learning algorithms. Connection Science, 1991.
- Pairwise proximal policy optimization: Harnessing relative feedback for LLM alignment. arXiv.org, 2310.00212, 2023.
- Outracing champion gran turismo drivers with deep reinforcement learning. Nature, 2022.
- Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In Proc. of the International Conf. on Learning Representations (ICLR), 2021.
- Mastering visual continuous control: Improved data-augmented reinforcement learning. In Proc. of the International Conf. on Learning Representations (ICLR), 2022.
- End-to-end urban driving by imitating a reinforcement learning coach. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), 2021.