Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 92 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Kimi K2 157 tok/s Pro
2000 character limit reached

Closing the Gap between TD Learning and Supervised Learning -- A Generalisation Point of View (2401.11237v2)

Published 20 Jan 2024 in cs.LG

Abstract: Some reinforcement learning (RL) algorithms can stitch pieces of experience to solve a task never seen before during training. This oft-sought property is one of the few ways in which RL methods based on dynamic-programming differ from RL methods based on supervised-learning (SL). Yet, certain RL methods based on off-the-shelf SL algorithms achieve excellent results without an explicit mechanism for stitching; it remains unclear whether those methods forgo this important stitching property. This paper studies this question for the problems of achieving a target goal state and achieving a target return value. Our main result is to show that the stitching property corresponds to a form of combinatorial generalization: after training on a distribution of (state, goal) pairs, one would like to evaluate on (state, goal) pairs not seen together in the training data. Our analysis shows that this sort of generalization is different from i.i.d. generalization. This connection between stitching and generalisation reveals why we should not expect SL-based RL methods to perform stitching, even in the limit of large datasets and models. Based on this analysis, we construct new datasets to explicitly test for this property, revealing that SL-based methods lack this stitching property and hence fail to perform combinatorial generalization. Nonetheless, the connection between stitching and combinatorial generalisation also suggests a simple remedy for improving generalisation in SL: data augmentation. We propose a temporal data augmentation and demonstrate that adding it to SL-based methods enables them to successfully complete tasks not seen together during training. On a high level, this connection illustrates the importance of combinatorial generalization for data efficiency in time-series data beyond tasks beyond RL, like audio, video, or text.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Juergen Schmidhuber. Reinforcement learning upside down: Don’t predict rewards – just map them to actions, 2020.
  2. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  3. Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751, 2021.
  4. Multi-game decision transformers, 2022.
  5. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
  6. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  7. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  8. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
  9. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  10. Compositional generalization from first principles, 2023.
  11. Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought, 2023.
  12. Unveiling transformers with lego: a synthetic reasoning task, 2023.
  13. Show your work: Scratchpads for intermediate computation with language models, 2021.
  14. The effectiveness of data augmentation in image classification using deep learning, 2017.
  15. D4rl: Datasets for deep data-driven reinforcement learning, 2021.
  16. When does return-conditioned supervised learning work for offline reinforcement learning? In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  17. A dissection of overfitting and generalization in continuous reinforcement learning, 2018.
  18. Quantifying generalization in reinforcement learning, 2019.
  19. The benefits of model-based generalization in reinforcement learning, 2023.
  20. Offline reinforcement learning as one big sequence modeling problem. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 1273–1286. Curran Associates, Inc., 2021.
  21. Learning invariant representations for reinforcement learning without reconstruction, 2021.
  22. Information prioritization through empowerment in visual model-based rl, 2022.
  23. Generalization in reinforcement learning with selective noise injection and information bottleneck, 2019.
  24. Robust reinforcement learning. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000.
  25. Action robust reinforcement learning and applications in continuous control, 2019.
  26. Robust predictable control, 2021.
  27. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning, 2022.
  28. Real world offline reinforcement learning with realistic data source, 2022.
  29. Pearl: A production-ready reinforcement learning agent, 2023.
  30. Showing your offline reinforcement learning work: Online evaluation budget matters, 2022.
  31. The distracting control suite – a challenging benchmark for reinforcement learning from pixels, 2021.
  32. Mt-opt: Continuous multi-task robotic reinforcement learning at scale, 2021.
  33. Curl: Contrastive unsupervised representations for reinforcement learning, 2020.
  34. Generalization in reinforcement learning by soft data augmentation, 2021.
  35. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels, 2021.
  36. Mastering visual continuous control: Improved data-augmented reinforcement learning, 2021.
  37. Sample-efficient reinforcement learning via counterfactual-based data augmentation, 2020.
  38. A survey on image data augmentation for deep learning. Journal of Big Data, 6:1–48, 2019.
  39. Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline RL, 2023.
  40. Return augmentation gives supervised RL temporal compositionality, 2023.
  41. Bats: Best action trajectory stitching, 2022.
  42. Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
  43. Handbook of Markov decision processes: methods and applications, volume 40. Springer Science & Business Media, 2012.
  44. Imitating past successes can be very suboptimal. arXiv preprint arXiv:2206.03378, 2022.
  45. Learning to reach goals via iterated supervised learning. arXiv preprint arXiv:1912.06088, 2019.
  46. Policy continuation with hindsight inverse dynamics. Advances in Neural Information Processing Systems, 32, 2019.
  47. Reward-conditioned policies, 2019.
  48. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
  49. Training neural networks to encode symbols enables combinatorial generalization, 2019.
  50. Bisimulation makes analogies in goal-conditioned reinforcement learning, 2022.
  51. A survey of zero-shot generalisation in deep reinforcement learning. Journal of Artificial Intelligence Research, 76:201–264, jan 2023.
  52. Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, 2014.
  53. A theory of learning from different domains. Mach. Learn., 79(1–2):151–175, may 2010.
  54. S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
  55. Function approximation via tile coding: Automating parameter choice. In Jean-Daniel Zucker and Lorenza Saitta, editors, Abstraction, Reformulation and Approximation, pages 194–205, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg.
  56. On the surprising behavior of distance metric in high-dimensional space. First publ. in: Database theory, ICDT 200, 8th International Conference, London, UK, January 4 - 6, 2001 / Jan Van den Bussche … (eds.). Berlin: Springer, 2001, pp. 420-434 (=Lecture notes in computer science ; 1973), 02 2002.
  57. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. CoRR, abs/2306.13831, 2023.
  58. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  59. Distinguishing rule- and exemplar-based generalization in learning systems, 2022.
  60. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
  61. Gymnasium, March 2023.
  62. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  63. K. Hinderer. Lipschitz continuity of value functions in markovian decision processes. Mathematical Methods of Operations Research, 62:3–22, 09 2005.
  64. Deepmdp: Learning continuous latent space models for representation learning, 2019.
  65. On the locality of action domination in sequential decision making. In Advances in Neural Information Processing Systems, 01 2010.
  66. Alfred Müller. How does the value function of a markov decision process depend on the transition probabilities? Math. Oper. Res., 22(4):872–885, nov 1997.
Citations (7)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that SL-based RL methods struggle with inherent combinatorial generalization, limiting their stitching capabilities.
  • Empirical results reveal that simply scaling data or models fails to achieve the necessary temporal relationship encoding.
  • Temporal data augmentation is introduced as an effective strategy to boost generalization in both state-based and image-based tasks.

Introduction

Reinforcement learning (RL) algorithms have the remarkable ability to stitch together experiences to tackle new problems. This capability enables RL-based solutions to handle tasks never explicitly encountered during training, arguably a distinguishing feature when comparing RL to supervised learning (SL). The literature includes RL algorithms based on dynamic programming that have long utilized this stitching property, enabling superior data efficiency and off-policy reasoning. However, the use of SL-based methods in RL has blurred the lines, as certain outcome conditional behavioral cloning (OCBC) methods have shown impressive results in benchmarks without an apparent mechanism facilitating stitching. This paper critically examines the generalization capabilities of such SL-based RL algorithms within the contexts of reaching a target goal state and attaining a specified return value.

Combinatorial Generalization and Stitching

Central to the concept of stitching is the idea of combinatorial generalization, which characterizes an algorithm's ability to combine previously learned experiences to handle new (state, goal) pairs not jointly observed during training. This is akin to a person navigating to a new location by combining knowledge of how to reach a familiar intermediary point, such as a taxi stand, and from there traveling to the ultimate destination. Traditional dynamic programming-based RL methods exhibit this ability through their inherent structure. In contrast, the paper posits that OCBC methods' reliance on SL principles hinders them from inherently performing combinatorial generalization. This deficiency is analytically demonstrated, casting doubt on the ability of SL-based algorithms to match RL methods requiring such combinatorial reasoning, especially when operating with temporal sequences.

Empirical Validation

The authors create datasets to explicitly test combinatorial generalization, and the empirical results show that strategies like Decision Transformers (DT) and Reward-Value Surrogate (RvS) fail to exhibit the stitching property. The standard test suites, such as D4RL, were found inadequate for this purpose as they inadvertently included (state, goal) pairs within the training distribution, thus not requiring genuine stitching. Experiments with newly created environments confirm the theory that merely scaling up data volume or model architecture will not endow SL-based methods with combinatorial generalization abilities.

Temporal Data Augmentation

Considering the lack of combinatorial generalization in OCBC algorithms, the authors suggest a simple yet effective remedy: temporal data augmentation. This strategy augments existing training data to explicitly encode temporal relationships between states across trajectories. Using such data augmentation, OCBC algorithms can learn to navigate between unseen (state, goal) pairs. Theoretical insights accompany empirical demonstrations that this form of augmentation enhances the generalization capabilities of SL-based approaches substantially in both state-based and image-based tasks.

Conclusion and Future Work

This paper contributes significantly by redefining the narrative around the capability of SL-based RL algorithms to stitch experiences and by providing a practical solution to imbue them with this vital asset. The temporal data augmentation technique demonstrates a valuable stride toward achieving combinatorial generalization. However, a more natural inclusion of combinatorial generalization in SL-based algorithms, without the need for explicit temporal augmentation, remains an open and intriguing avenue for further research. The broader implications for time-series data processing tasks hint at the potential for significant advances in data efficiency across a variety of domains beyond RL.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.