Closing the Gap between TD Learning and Supervised Learning -- A Generalisation Point of View (2401.11237v2)
Abstract: Some reinforcement learning (RL) algorithms can stitch pieces of experience to solve a task never seen before during training. This oft-sought property is one of the few ways in which RL methods based on dynamic-programming differ from RL methods based on supervised-learning (SL). Yet, certain RL methods based on off-the-shelf SL algorithms achieve excellent results without an explicit mechanism for stitching; it remains unclear whether those methods forgo this important stitching property. This paper studies this question for the problems of achieving a target goal state and achieving a target return value. Our main result is to show that the stitching property corresponds to a form of combinatorial generalization: after training on a distribution of (state, goal) pairs, one would like to evaluate on (state, goal) pairs not seen together in the training data. Our analysis shows that this sort of generalization is different from i.i.d. generalization. This connection between stitching and generalisation reveals why we should not expect SL-based RL methods to perform stitching, even in the limit of large datasets and models. Based on this analysis, we construct new datasets to explicitly test for this property, revealing that SL-based methods lack this stitching property and hence fail to perform combinatorial generalization. Nonetheless, the connection between stitching and combinatorial generalisation also suggests a simple remedy for improving generalisation in SL: data augmentation. We propose a temporal data augmentation and demonstrate that adding it to SL-based methods enables them to successfully complete tasks not seen together during training. On a high level, this connection illustrates the importance of combinatorial generalization for data efficiency in time-series data beyond tasks beyond RL, like audio, video, or text.
- Juergen Schmidhuber. Reinforcement learning upside down: Don’t predict rewards – just map them to actions, 2020.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
- Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751, 2021.
- Multi-game decision transformers, 2022.
- Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
- Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
- Compositional generalization from first principles, 2023.
- Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought, 2023.
- Unveiling transformers with lego: a synthetic reasoning task, 2023.
- Show your work: Scratchpads for intermediate computation with language models, 2021.
- The effectiveness of data augmentation in image classification using deep learning, 2017.
- D4rl: Datasets for deep data-driven reinforcement learning, 2021.
- When does return-conditioned supervised learning work for offline reinforcement learning? In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- A dissection of overfitting and generalization in continuous reinforcement learning, 2018.
- Quantifying generalization in reinforcement learning, 2019.
- The benefits of model-based generalization in reinforcement learning, 2023.
- Offline reinforcement learning as one big sequence modeling problem. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 1273–1286. Curran Associates, Inc., 2021.
- Learning invariant representations for reinforcement learning without reconstruction, 2021.
- Information prioritization through empowerment in visual model-based rl, 2022.
- Generalization in reinforcement learning with selective noise injection and information bottleneck, 2019.
- Robust reinforcement learning. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000.
- Action robust reinforcement learning and applications in continuous control, 2019.
- Robust predictable control, 2021.
- Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning, 2022.
- Real world offline reinforcement learning with realistic data source, 2022.
- Pearl: A production-ready reinforcement learning agent, 2023.
- Showing your offline reinforcement learning work: Online evaluation budget matters, 2022.
- The distracting control suite – a challenging benchmark for reinforcement learning from pixels, 2021.
- Mt-opt: Continuous multi-task robotic reinforcement learning at scale, 2021.
- Curl: Contrastive unsupervised representations for reinforcement learning, 2020.
- Generalization in reinforcement learning by soft data augmentation, 2021.
- Image augmentation is all you need: Regularizing deep reinforcement learning from pixels, 2021.
- Mastering visual continuous control: Improved data-augmented reinforcement learning, 2021.
- Sample-efficient reinforcement learning via counterfactual-based data augmentation, 2020.
- A survey on image data augmentation for deep learning. Journal of Big Data, 6:1–48, 2019.
- Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline RL, 2023.
- Return augmentation gives supervised RL temporal compositionality, 2023.
- Bats: Best action trajectory stitching, 2022.
- Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
- Handbook of Markov decision processes: methods and applications, volume 40. Springer Science & Business Media, 2012.
- Imitating past successes can be very suboptimal. arXiv preprint arXiv:2206.03378, 2022.
- Learning to reach goals via iterated supervised learning. arXiv preprint arXiv:1912.06088, 2019.
- Policy continuation with hindsight inverse dynamics. Advances in Neural Information Processing Systems, 32, 2019.
- Reward-conditioned policies, 2019.
- Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
- Training neural networks to encode symbols enables combinatorial generalization, 2019.
- Bisimulation makes analogies in goal-conditioned reinforcement learning, 2022.
- A survey of zero-shot generalisation in deep reinforcement learning. Journal of Artificial Intelligence Research, 76:201–264, jan 2023.
- Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, 2014.
- A theory of learning from different domains. Mach. Learn., 79(1–2):151–175, may 2010.
- S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
- Function approximation via tile coding: Automating parameter choice. In Jean-Daniel Zucker and Lorenza Saitta, editors, Abstraction, Reformulation and Approximation, pages 194–205, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg.
- On the surprising behavior of distance metric in high-dimensional space. First publ. in: Database theory, ICDT 200, 8th International Conference, London, UK, January 4 - 6, 2001 / Jan Van den Bussche … (eds.). Berlin: Springer, 2001, pp. 420-434 (=Lecture notes in computer science ; 1973), 02 2002.
- Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. CoRR, abs/2306.13831, 2023.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Distinguishing rule- and exemplar-based generalization in learning systems, 2022.
- Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
- Gymnasium, March 2023.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- K. Hinderer. Lipschitz continuity of value functions in markovian decision processes. Mathematical Methods of Operations Research, 62:3–22, 09 2005.
- Deepmdp: Learning continuous latent space models for representation learning, 2019.
- On the locality of action domination in sequential decision making. In Advances in Neural Information Processing Systems, 01 2010.
- Alfred Müller. How does the value function of a markov decision process depend on the transition probabilities? Math. Oper. Res., 22(4):872–885, nov 1997.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.