Learning Universal Predictors (2401.14953v1)
Abstract: Meta-learning has emerged as a powerful approach to train neural networks to learn new tasks quickly from limited data. Broad exposure to different tasks leads to versatile representations enabling general problem solving. But, what are the limits of meta-learning? In this work, we explore the potential of amortizing the most powerful universal predictor, namely Solomonoff Induction (SI), into neural networks via leveraging meta-learning to its limits. We use Universal Turing Machines (UTMs) to generate training data used to expose networks to a broad range of patterns. We provide theoretical analysis of the UTM data generation processes and meta-training protocols. We conduct comprehensive experiments with neural architectures (e.g. LSTMs, Transformers) and algorithmic data generators of varying complexity and universality. Our results suggest that UTM data is a valuable resource for meta-learning, and that it can be used to train neural networks capable of learning universal prediction strategies.
- C. Böhm. On a family of turing machines and the related programming language. ICC bulletin, 3:185–194, 1964.
- An Introduction to Universal Artificial Intelligence. Chapman & Hall/CRC Artificial Intelligence and Robotics Series. Taylor and Francis, 2024. ISBN 9781032607153. URL http://www.hutter1.net/ai/uaibook2.htm. 400+ pages, http://www.hutter1.net/ai/uaibook2.htm.
- Recurrent neural networks as weighted language recognizers. arXiv preprint arXiv:1711.05408, 2017.
- N. Chomsky. Three models for the description of language. IRE Transactions on information theory, 2(3):113–124, 1956.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations, 2022.
- Language modeling is compression, 2023.
- Input–output maps are strongly biased towards simple outputs. Nature communications, 9(1):761, 2018.
- J. L. Elman. Finding structure in time. Cogn. Sci., 1990.
- Loss bounds and time complexity for speed priors. In Artificial Intelligence and Statistics, pages 1394–1402. PMLR, 2016.
- Memory-based meta-learning on non-stationary distributions. International Conference on Machine Learning, 2023.
- M. Hahn and N. Goyal. A theory of emergent in-context learning as implicit structure induction. arXiv preprint arXiv:2303.07971, 2023.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 1997.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021.
- M. Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004.
- M. Hutter. Human knowledge compression prize, 2006/2020. open ended, http://prize.hutter1.net/.
- M. Hutter. On universal prediction and Bayesian confirmation. Theoretical Computer Science, 384(1):33–48, 2007. ISSN 0304-3975. 10.1016/j.tcs.2007.05.016. URL http://arxiv.org/abs/0709.1516.
- M. Hutter. Universal learning theory. In C. Sammut and G. Webb, editors, Encyclopedia of Machine Learning and Data Mining, pages 1295–1304. Springer, 2nd edition, 2017. ISBN 978-1-4899-7686-4. 10.1007/978-1-4899-7687-1_867. URL http://arxiv.org/abs/1102.2467.
- Algorithmic probability. Scholarpedia, 2(8):2572, 2007. ISSN 1941-6016. 10.4249/scholarpedia.2572.
- A. Joulin and T. Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets. In Advances in Neural Information Processing Systems 28, 2015.
- Pre-training without natural images. In Proceedings of the Asian Conference on Computer Vision, 2020.
- J. D. M.-W. C. Kenton and L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Universal prediction of selected bits. In Algorithmic Learning Theory: 22nd International Conference, ALT 2011, Espoo, Finland, October 5-7, 2011. Proceedings 22, pages 262–276. Springer, 2011.
- Smart augmentation learning an optimal data augmentation strategy. Ieee Access, 5:5858–5869, 2017.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=DeG07_TcZvT.
- M. Li and P. M. Vitanyi. Inductive reasoning and kolmogorov complexity. Journal of Computer and System Sciences, 44(2):343–384, 1992.
- An introduction to Kolmogorov complexity and its applications. Springer, 4th edition, 2019.
- Transformers as algorithms: Generalization and implicit model selection in in-context learning. arXiv preprint arXiv:2301.07067, 2023b.
- Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=De4FYqjFueZ.
- On the computational complexity and formal hierarchy of second order recurrent neural networks. arXiv preprint arXiv:2309.14691, 2023.
- Meta-trained agents implement bayes-optimal agents. Advances in neural information processing systems, 33:18691–18703, 2020.
- Do deep neural networks have an inbuilt occam’s razor? arXiv preprint arXiv:2304.06670, 2023.
- U. Müller. Brainf*ck. https://esolangs.org/wiki/Brainfuck, 1993. [Online; accessed 21-Sept-2023].
- Meta-learning of sequential strategies. arXiv preprint arXiv:1905.03030, 2019.
- L. Perez and J. Wang. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
- S. Rathmanner and M. Hutter. A philosophical treatise of universal induction. Entropy, 13(6):1076–1136, 2011.
- J. Schmidhuber. The speed prior: A new simplicity measure yielding near-optimal computable predictions. In Proc. 15th Conf. on Computational Learning Theory (COLT’02), volume 2375 of LNAI, pages 216–228, Sydney, Australia, 2002. Springer.
- M. Sipser. Introduction to the Theory of Computation. Course Technology Cengage Learning, Boston, MA, 3rd ed edition, 2012. ISBN 978-1-133-18779-0.
- R. J. Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1–22, 1964a.
- R. J. Solomonoff. A formal theory of inductive inference. part ii. Information and control, 7(2):224–254, 1964b.
- T. F. Sterkenburg. A generalized characterization of algorithmic probability. Theory of Computing Systems, 61:1337–1352, 2017.
- A provably stable neural network turing machine. arXiv preprint arXiv:2006.03651, 2020.
- P. Sunehag and M. Hutter. Principles of solomonoff induction and aixi. In Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence: Papers from the Ray Solomonoff 85th Memorial Conference, Melbourne, VIC, Australia, November 30–December 2, 2011, pages 386–398. Springer, 2013.
- P. Sunehag and M. Hutter. Intelligence as inference or forcing Occam on the world. In Proc. 7th Conf. on Artificial General Intelligence (AGI’14), volume 8598 of LNAI, pages 186–195, Quebec City, Canada, 2014. Springer. ISBN 978-3-319-09273-7. 10.1007/978-3-319-09274-4_18.
- Memory-augmented recurrent neural networks can learn generalized dyck languages. CoRR, 2019.
- Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522, 2018.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, 2017.
- On ensemble techniques for aixi approximation. In International Conference on Artificial General Intelligence, pages 341–351. Springer, 2012.
- Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. arXiv preprint arXiv:2301.11916, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Reflections on “the context tree weighting method: Basic properties”. Newsletter of the IEEE Information Theory Society, 47(1), 1997.
- F. M. Willems. The context-tree weighting method: Extensions. IEEE Transactions on Information Theory, 44(2):792–798, 1998.
- The context-tree weighting method: Basic properties. IEEE transactions on information theory, 41(3):653–664, 1995.
- (Non-)equivalence of universal priors. In Proc. Solomonoff 85th Memorial Conference, volume 7070 of LNAI, pages 417–425, Melbourne, Australia, 2011. Springer. ISBN 978-3-642-44957-4. 10.1007/978-3-642-44958-1_33. URL http://arxiv.org/abs/1111.3854.
- (non-) equivalence of universal priors. In Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence: Papers from the Ray Solomonoff 85th Memorial Conference, Melbourne, VIC, Australia, November 30–December 2, 2011, pages 417–425. Springer, 2013.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI.