Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories (2210.06518v3)
Abstract: Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action and reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop and study a simple meta-algorithmic pipeline that learns an inverse dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful -- on several D4RL benchmarks~\cite{fu2020d4rl}, certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10\% of trajectories which are highly suboptimal. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.
- Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.
- A framework for behavioural cloning. In Machine Intelligence 15, pp. 103–129, 1995.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022. URL https://arxiv.org/abs/2206.11795.
- Bellman, R. A markovian decision process. Indiana Univ. Math. J., 1957.
- Humanoid robot learning and game playing using pc-based vision. In IEEE/RSJ international conference on intelligent robots and systems, volume 3, pp. 2449–2454. IEEE, 2002.
- Large-scale study of curiosity-driven learning. In ICLR, 2019.
- Offline reinforcement learning at multiple frequencies. arXiv preprint arXiv:2207.13082, 2022.
- Semi-supervised learning. 2006. Cambridge, Massachusettes: The MIT Press View Article, 2, 2006.
- Decision transformer: Reinforcement learning via sequence modeling. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=a7APmM4B9d.
- Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751, 2021.
- Fralick, S. Learning to recognize patterns without a teacher. IEEE Transactions on Information Theory, 13(1):57–64, 1967.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
- A minimalist approach to offline reinforcement learning. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Q32U7dzWXpc.
- Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. PMLR, 2019.
- Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pp. 3682–3691. PMLR, 2021.
- Offline rl policies should be trained to be adaptive. In International Conference on Machine Learning, pp. 7513–7530. PMLR, 2022.
- Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv preprint arXiv:1703.02949, 2017.
- Exploration via elliptical episodic bonuses. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Xg-yZos9qJQ.
- Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
- Movement imitation with nonlinear dynamical systems in humanoid robots. In Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), volume 2, pp. 1398–1403. IEEE, 2002.
- Offline reinforcement learning as one big sequence modeling problem. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=wgeK563QgSw.
- Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
- MobILE: Model-based imitation learning from observation alone. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=_Rtm4rYnIIL.
- Domain adaptive imitation learning. In International Conference on Machine Learning, pp. 5286–5295. PMLR, 2020.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Probabilistic graphical models: principles and techniques. MIT press, 2009.
- Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021a.
- Offline reinforcement learning with implicit q-learning, 2021b.
- Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019.
- Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020.
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
- Multi-game decision transformers. arXiv preprint arXiv:2205.15241, 2022.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125. IEEE, 2018.
- Improving zero-shot generalization in offline reinforcement learning using generalized similarity functions. arXiv preprint arXiv:2111.14629, 2021.
- Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
- An overview of deep semi-supervised learning. arXiv preprint arXiv:2006.05278, 2020.
- Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pp. 2778–2787. PMLR, 2017.
- Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control, pp. 1154–1168. PMLR, 2021.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Reinforcement learning with videos: Combining offline observations with interaction. arXiv preprint arXiv:2011.06507, 2020a.
- Learning predictive models from observation and interaction. In European Conference on Computer Vision, pp. 708–725. Springer, 2020b.
- Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1134–1141. IEEE, 2018.
- Third-person visual imitation learning via decoupled hierarchical controller. In NeurIPS, 2019.
- Third-person imitation learning. CoRR, abs/1703.01703, 2017. URL http://arxiv.org/abs/1703.01703.
- Behavioral cloning from observation. CoRR, abs/1805.01954, 2018a. URL http://arxiv.org/abs/1805.01954.
- Generative adversarial imitation from observation. CoRR, abs/1807.06158, 2018b. URL http://arxiv.org/abs/1807.06158.
- Recent advances in imitation learning from observation. arXiv preprint arXiv:1905.13566, 2019.
- A survey on semi-supervised learning. Machine Learning, 109(2):373–440, 2020.
- Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
- How to leverage unlabeled data in offline reinforcement learning. arXiv preprint arXiv:2202.01741, 2022.
- Online decision transformer. arXiv preprint arXiv:2202.05607, 2022.
- Zhu, X. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.