Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models (2307.05782v2)

Published 11 Jul 2023 in cs.CL, hep-th, math.HO, and physics.comp-ph
Large Language Models

Abstract: Artificial intelligence is making spectacular progress, and one of the best examples is the development of LLMs such as OpenAI's GPT series. In these lectures, written for readers with a background in mathematics or physics, we give a brief history and survey of the state of the art, and describe the underlying transformer architecture in detail. We then explore some current ideas on how LLMs work and how models trained to predict the next word in a text are able to perform other tasks displaying intelligence.

Understanding LLMs: A Comprehensive Overview

Introduction to LLMs

LLMs like GPT-4 represent significant advancements in the field of artificial intelligence, particularly in natural language processing. Their ability to generate human-like text, understand context, and solve complex problems marks a major leap forward. This overview explores the intricacies of LLMs, focusing on their architecture, training procedures, current capabilities, and theoretical underpinnings.

Transformer Architecture

At the heart of the most advanced LLMs is the transformer architecture. This model eschews traditional sequential processing in favor of parallelizable attention mechanisms, allowing LLMs to efficiently handle long-range dependencies in text. A transformer model alternates between layers of multi-head self-attention and position-wise fully connected feed-forward networks. The incorporation of positional embeddings enables the model to maintain the order of words, a key aspect of understanding language. This architecture is pivotal for the scalability and effectiveness of LLMs.

Training Process and Hyperparameters

LLMs undergo extensive training on vast corpora, such as the entirety of the internet. The training employs a generative pre-training objective, where the model learns to predict the next word in a sequence given the preceding words. Hyperparameters for state-of-the-art models, such as GPT-3, include embedding dimensions, the number of layers, window size, and several others detailed specifically for GPT-3's architecture. The learning involves optimizing a cross-entropy loss function using gradient descent, with specific attention to regularization and learning rate adjustments to prevent overfitting and ensure efficient training.

Capabilities and Limitations

LLMs demonstrate remarkable linguistic capabilities, including text generation, question-answering, translation, and even coding. However, they are not without limitations. These models often struggle with tasks requiring deep logical reasoning, planning, or a comprehensive understanding of the world. Furthermore, LLMs can "hallucinate" or generate inaccurate information, posing challenges for reliability and trustworthiness.

Theoretical Insights and Understanding LLMs

Understanding why LLMs work so well is an ongoing effort within the research community. Investigations into the internal mechanics of LLMs reveal that they may learn and internally represent complex linguistic structures, such as parse trees, through their embeddings and attention mechanisms. Moreover, studying LLMs through the lens of computational complexity theory provides insights into the types of problems LLMs can efficiently solve.

Future Directions and Speculations

The field of LLMs is ripe with questions and potential developments. Addressing limitations in planning, confidence in outputs, and reflection - the model's ability to understand and reason about its processing and outputs - are key areas of future research. Enhancing LLMs' understanding and generation capabilities might involve integrating mechanisms for more explicit logical reasoning and world modeling, possibly drawing from advances in other areas of AI and cognitive science.

Conclusion

LLMs represent a significant advancement in artificial intelligence, with the potential to transform how machines understand and generate human language. While their current capabilities are impressive, understanding their inner workings and addressing their limitations remain crucial areas of ongoing research. The exploration into the core functionalities, theoretical foundations, and future enhancement strategies for LLMs opens up exciting avenues for advancements in natural language understanding and beyond.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (146)
  1. Reproduced under the cc by 4.0 license. https://creativecommons.org/licenses/by/4.0/.
  2. What learning algorithm is in-context learning? Investigations with linear models, November 2022. arXiv:2211.15661 [cs]. URL: http://arxiv.org/abs/2211.15661, doi:10.48550/arXiv.2211.15661.
  3. A distinct cortical network for mathematical knowledge in the human brain. NeuroImage, 189:19–31, April 2019. URL: https://www.sciencedirect.com/science/article/pii/S1053811919300011, doi:10.1016/j.neuroimage.2019.01.001.
  4. Computational complexity: a modern approach. Cambridge University Press, 2009.
  5. A Latent Variable Model Approach to PMI-based Word Embeddings. arXiv:1502.03520 [cs, stat], June 2019. arXiv: 1502.03520. URL: http://arxiv.org/abs/1502.03520.
  6. ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics, February 2023. arXiv:2302.12433 [cs]. URL: http://arxiv.org/abs/2302.12433, doi:10.48550/arXiv.2302.12433.
  7. Dimensions of Neural-symbolic Integration - A Structured Survey, November 2005. arXiv:cs/0511042. URL: http://arxiv.org/abs/cs/0511042, doi:10.48550/arXiv.cs/0511042.
  8. Explaining Neural Scaling Laws. February 2021. URL: https://arxiv.org/abs/2102.06701v1.
  9. Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit, July 2022. arXiv:2207.08799 [cs, math, stat]. URL: http://arxiv.org/abs/2207.08799, doi:10.48550/arXiv.2207.08799.
  10. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. Publisher: National Acad Sciences.
  11. Deep learning: a statistical viewpoint. arXiv:2103.09177 [cs, math, stat], March 2021. arXiv: 2103.09177. URL: http://arxiv.org/abs/2103.09177.
  12. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48:207–219, 2021. URL: http://arxiv.org/abs/2102.12452.
  13. Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. arXiv:2105.14368 [cs, math, stat], May 2021. arXiv: 2105.14368. URL: http://arxiv.org/abs/2105.14368.
  14. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. Publisher: National Academy of Sciences _eprint: https://www.pnas.org/content/116/32/15849.full.pdf. URL: https://www.pnas.org/content/116/32/15849, doi:10.1073/pnas.1903070116.
  15. To understand deep learning we need to understand kernel learning. February 2018. URL: https://arxiv.org/abs/1802.01396v3.
  16. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online, July 2020. Association for Computational Linguistics. URL: https://aclanthology.org/2020.acl-main.463, doi:10.18653/v1/2020.acl-main.463.
  17. Yoshua Bengio. From system 1 deep learning to system 2 deep learning, December 2019. URL: https://slideslive.com/38922304/from-system-1-deep-learning-to-system-2-deep-learning.
  18. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  19. Pattern recognition and machine learning, volume 4. Springer, 2006.
  20. An enriched category theory of language: from syntax to semantics. arXiv:2106.07890 [cs, math], June 2021. arXiv: 2106.07890. URL: http://arxiv.org/abs/2106.07890.
  21. The mathematics of statistical machine translation: Parameter estimation. 1993.
  22. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs], June 2020. arXiv: 2005.14165. URL: http://arxiv.org/abs/2005.14165.
  23. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
  24. Sparks of Artificial General Intelligence: Early experiments with GPT-4, March 2023. URL: https://arxiv.org/abs/2303.12712v1.
  25. Discovering Latent Knowledge in Language Models Without Supervision, December 2022. arXiv:2212.03827 [cs]. URL: http://arxiv.org/abs/2212.03827.
  26. Recurrent Neural Networks as Weighted Language Recognizers, March 2018. arXiv:1711.05408 [cs]. URL: http://arxiv.org/abs/1711.05408, doi:10.48550/arXiv.1711.05408.
  27. Finding Universal Grammatical Relations in Multilingual BERT. arXiv:2005.04511 [cs], May 2020. arXiv: 2005.04511. URL: http://arxiv.org/abs/2005.04511.
  28. Tighter Bounds on the Expressivity of Transformer Encoders, May 2023. arXiv:2301.10743 [cs]. URL: http://arxiv.org/abs/2301.10743, doi:10.48550/arXiv.2301.10743.
  29. Ted Chiang. Chatgpt is a blurry jpeg of the web. The New Yorker, February 2023.
  30. Generating Long Sequences with Sparse Transformers. April 2019. URL: https://arxiv.org/abs/1904.10509v1.
  31. The Loss Surfaces of Multilayer Networks, January 2015. arXiv:1412.0233 [cs]. URL: http://arxiv.org/abs/1412.0233, doi:10.48550/arXiv.1412.0233.
  32. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs], April 2022. arXiv: 2204.02311. URL: http://arxiv.org/abs/2204.02311.
  33. A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations, May 2023. arXiv:2302.03025 [cs, math]. URL: http://arxiv.org/abs/2302.03025, doi:10.48550/arXiv.2302.03025.
  34. Mathematical Foundations for a Compositional Distributional Model of Meaning, March 2010. arXiv:1003.4394 [cs, math]. URL: http://arxiv.org/abs/1003.4394, doi:10.48550/arXiv.1003.4394.
  35. Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999. PMLR, 2016. arXiv:1602.07576.
  36. Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.
  37. Francis Crick. The recent excitement about neural networks. Nature, 337:129–132, 1989.
  38. George V. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2:303–314, 1989.
  39. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. October 2018. arXiv: 1810.04805. URL: https://arxiv.org/abs/1810.04805v1.
  40. Neural Network Approximation. arXiv:2012.14501 [cs, math], December 2020. arXiv: 2012.14501. URL: http://arxiv.org/abs/2012.14501.
  41. Inductive Biases and Variable Creation in Self-Attention Mechanisms. arXiv:2110.10090 [cs, stat], October 2021. arXiv: 2110.10090. URL: http://arxiv.org/abs/2110.10090.
  42. Toy Models of Superposition, September 2022. arXiv:2209.10652 [cs]. URL: http://arxiv.org/abs/2209.10652, doi:10.48550/arXiv.2209.10652.
  43. Trapping LLM Hallucinations Using Tagged Context Prompts, June 2023. arXiv:2306.06085 [cs]. URL: http://arxiv.org/abs/2306.06085, doi:10.48550/arXiv.2306.06085.
  44. John Rupert Firth. Studies in linguistic analysis. Wiley-Blackwell, 1957.
  45. Jerry A Fodor. The modularity of mind. MIT press, 1983.
  46. Learning Transformer Programs, June 2023. arXiv:2306.01128 [cs]. URL: http://arxiv.org/abs/2306.01128, doi:10.48550/arXiv.2306.01128.
  47. Neurosymbolic AI: The 3rd Wave, December 2020. arXiv:2012.05876 [cs]. URL: http://arxiv.org/abs/2012.05876.
  48. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes, January 2023. arXiv:2208.01066 [cs]. URL: http://arxiv.org/abs/2208.01066, doi:10.48550/arXiv.2208.01066.
  49. Deep learning. MIT press, 2016.
  50. Inductive Biases for Deep Learning of Higher-Level Cognition, August 2022. arXiv:2011.15091 [cs, stat]. URL: http://arxiv.org/abs/2011.15091, doi:10.48550/arXiv.2011.15091.
  51. Andrey Gromov. Grokking modular arithmetic, January 2023. arXiv:2301.02679 [cond-mat]. URL: http://arxiv.org/abs/2301.02679, doi:10.48550/arXiv.2301.02679.
  52. A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys, 51(5):1–42, September 2019. URL: https://dl.acm.org/doi/10.1145/3236009, doi:10.1145/3236009.
  53. Thilo Hagendorff. Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods, April 2023. arXiv:2303.13988 [cs]. URL: http://arxiv.org/abs/2303.13988, doi:10.48550/arXiv.2303.13988.
  54. A Theory of Emergent In-Context Learning as Implicit Structure Induction, March 2023. arXiv:2303.07971 [cs]. URL: http://arxiv.org/abs/2303.07971, doi:10.48550/arXiv.2303.07971.
  55. Scaling Laws for Transfer, February 2021. arXiv:2102.01293 [cs]. URL: http://arxiv.org/abs/2102.01293, doi:10.48550/arXiv.2102.01293.
  56. A Structural Probe for Finding Syntax in Word Representations. page 10, 2019.
  57. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  58. Training Compute-Optimal Large Language Models, March 2022. arXiv:2203.15556 [cs]. URL: http://arxiv.org/abs/2203.15556, doi:10.48550/arXiv.2203.15556.
  59. Energy Transformer, February 2023. arXiv:2302.07253 [cond-mat, q-bio, stat]. URL: http://arxiv.org/abs/2302.07253, doi:10.48550/arXiv.2302.07253.
  60. Feng-Hsiung Hsu. Behind Deep Blue: Building the computer that defeated the world chess champion. Princeton University Press, 2002.
  61. Consistency Analysis of ChatGPT, March 2023. arXiv:2303.06273 [cs]. URL: http://arxiv.org/abs/2303.06273, doi:10.48550/arXiv.2303.06273.
  62. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs, November 2022. arXiv:2210.12283 [cs]. URL: http://arxiv.org/abs/2210.12283, doi:10.48550/arXiv.2210.12283.
  63. Iain M. Johnstone. High Dimensional Statistical Inference and Random Matrices, November 2006. URL: https://arxiv.org/abs/math/0611589v1.
  64. Speech and language processing, 2009.
  65. Language Models (Mostly) Know What They Know, July 2022. arXiv:2207.05221 [cs]. URL: http://arxiv.org/abs/2207.05221.
  66. Daniel Kahneman. Fast and slow thinking. Allen Lane and Penguin Books, New York, 2011.
  67. Scaling Laws for Neural Language Models, January 2020. arXiv:2001.08361 [cs, stat]. URL: http://arxiv.org/abs/2001.08361, doi:10.48550/arXiv.2001.08361.
  68. An introduction to computational learning theory. MIT press, 1994.
  69. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  70. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  71. Statistical Physics, Optimization, Inference, and Message-Passing Algorithms: Lecture Notes of the Les Houches School of Physics: Special Issue, October 2013. Number 2013. Oxford University Press, 2016.
  72. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. URL: https://www.sciencemag.org/lookup/doi/10.1126/science.aab3050, doi:10.1126/science.aab3050.
  73. Building Machines That Learn and Think Like People. April 2016. URL: http://arxiv.org/abs/1604.00289.
  74. Yann LeCun. Popular talks and private discussion, 2015.
  75. Yann LeCun. A path towards autonomous machine intelligence, 2022. URL: https://openreview.net/forum?id=BZ5a1r-kVsf.
  76. Deep learning. Nature, 521:436–444, 2015.
  77. Solving Quantitative Reasoning Problems with Language Models, June 2022. Number: arXiv:2206.14858 arXiv:2206.14858 [cs]. URL: http://arxiv.org/abs/2206.14858, doi:10.48550/arXiv.2206.14858.
  78. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task, February 2023. arXiv:2210.13382 [cs]. URL: http://arxiv.org/abs/2210.13382, doi:10.48550/arXiv.2210.13382.
  79. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, June 2023. arXiv:2306.03341 [cs]. URL: http://arxiv.org/abs/2306.03341, doi:10.48550/arXiv.2306.03341.
  80. Holistic Evaluation of Language Models, November 2022. arXiv:2211.09110 [cs]. URL: http://arxiv.org/abs/2211.09110, doi:10.48550/arXiv.2211.09110.
  81. Let’s Verify Step by Step, May 2023. arXiv:2305.20050 [cs]. URL: http://arxiv.org/abs/2305.20050, doi:10.48550/arXiv.2305.20050.
  82. Transformers Learn Shortcuts to Automata, October 2022. arXiv:2210.10749 [cs, stat]. URL: http://arxiv.org/abs/2210.10749.
  83. David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
  84. Dissociating language and thought in large language models: a cognitive perspective, January 2023. arXiv:2301.06627 [cs]. URL: http://arxiv.org/abs/2301.06627, doi:10.48550/arXiv.2301.06627.
  85. A Solvable Model of Neural Scaling Laws, October 2022. arXiv:2210.16859 [hep-th, stat]. URL: http://arxiv.org/abs/2210.16859, doi:10.48550/arXiv.2210.16859.
  86. Semantic Spaces. arXiv.org, May 2016. arXiv: 1605.04238v1. URL: http://arxiv.org/abs/1605.04238v1.
  87. Foundations of statistical natural language processing. MIT press, 1999.
  88. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48):30046–30054, 2020.
  89. Mathematical Structure of Syntactic Merge, May 2023. arXiv:2305.18278 [cs, math]. URL: http://arxiv.org/abs/2305.18278, doi:10.48550/arXiv.2305.18278.
  90. Building a large annotated corpus of english: The penn treebank. 1993.
  91. David Marr. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010.
  92. Jirǐ Matoušek. Lecture notes on metric embeddings. 2013. URL: https://kam.mff.cuni.cz/~%****␣transreviewv2final.bbl␣Line␣700␣****matousek/ba-a4.pdf.
  93. Machines who think: A personal inquiry into the history and prospects of artificial intelligence. CRC Press, 2004.
  94. William Merrill. On the Linguistic Capacity of Real-Time Counter Automata. arXiv:2004.06866 [cs], April 2020. arXiv: 2004.06866. URL: http://arxiv.org/abs/2004.06866.
  95. The Parallelism Tradeoff: Limitations of Log-Precision Transformers, April 2023. arXiv:2207.00729 [cs]. URL: http://arxiv.org/abs/2207.00729, doi:10.48550/arXiv.2207.00729.
  96. Saturated Transformers are Constant-Depth Threshold Circuits. arXiv:2106.16213 [cs], April 2022. arXiv: 2106.16213. URL: http://arxiv.org/abs/2106.16213.
  97. Information, physics, and computation. Oxford University Press, 2009.
  98. The Quantization Model of Neural Scaling, March 2023. arXiv:2303.13506 [cond-mat]. URL: http://arxiv.org/abs/2303.13506, doi:10.48550/arXiv.2303.13506.
  99. Efficient Estimation of Word Representations in Vector Space, September 2013. arXiv:1301.3781 [cs]. URL: http://arxiv.org/abs/1301.3781.
  100. Marvin Minsky. Society of mind. Simon and Schuster, 1988.
  101. David Mumford. Pattern theory: the mathematics of perception. arXiv preprint math/0212400, 2002.
  102. Pattern theory: the stochastic analysis of real-world signals. CRC Press, 2010.
  103. Progress measures for grokking via mechanistic interpretability, January 2023. arXiv:2301.05217 [cs]. URL: http://arxiv.org/abs/2301.05217, doi:10.48550/arXiv.2301.05217.
  104. Empirical explorations of the logic theory machine: a case study in heuristic. In Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability, pages 218–230, 1957.
  105. Nils J Nilsson. The quest for artificial intelligence. Cambridge University Press, 2009.
  106. Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases, 2022. URL: https://transformer-circuits.pub/2022/mech-interp-essay/index.html.
  107. In-context Learning and Induction Heads, September 2022. arXiv:2209.11895 [cs]. URL: http://arxiv.org/abs/2209.11895, doi:10.48550/arXiv.2209.11895.
  108. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/D14-1162, doi:10.3115/v1/D14-1162.
  109. Formal Algorithms for Transformers, July 2022. arXiv:2207.09238 [cs]. URL: http://arxiv.org/abs/2207.09238.
  110. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv:2201.02177 [cs], January 2022. arXiv: 2201.02177. URL: http://arxiv.org/abs/2201.02177.
  111. Measuring and Narrowing the Compositionality Gap in Language Models, October 2022. arXiv:2210.03350 [cs]. URL: http://arxiv.org/abs/2210.03350, doi:10.48550/arXiv.2210.03350.
  112. Improving language understanding by generative pre-training. 2018. Publisher: OpenAI.
  113. Language Models are Unsupervised Multitask Learners. undefined, 2019. URL: https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe.
  114. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs, stat], July 2020. arXiv: 1910.10683. URL: http://arxiv.org/abs/1910.10683.
  115. Hopfield Networks is All You Need, April 2021. arXiv:2008.02217 [cs, stat]. URL: http://arxiv.org/abs/2008.02217, doi:10.48550/arXiv.2008.02217.
  116. The Principles of Deep Learning Theory. arXiv:2106.10165 [hep-th, stat], August 2021. arXiv: 2106.10165. URL: http://arxiv.org/abs/2106.10165.
  117. Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
  118. A general framework for parallel distributed processing. 1986.
  119. Stuart J Russell. Artificial intelligence a modern approach. Pearson Education, Inc., 2010.
  120. Are emergent abilities of large language models a mirage? ArXiv, abs/2304.15004, 2023.
  121. Terrence J Sejnowski. The deep learning revolution. MIT press, 2018.
  122. Claude E Shannon. Xxii. programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, 1950.
  123. On the computational power of neural nets. In Proceedings of the fifth annual workshop on Computational learning theory, pages 440–449, 1992.
  124. Brian Cantwell Smith. Procedural reflection in programming languages volume i. 1982.
  125. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015. arXiv:1503.03585.
  126. Beyond neural scaling laws: beating power law scaling via data pruning, June 2022. Number: arXiv:2206.14486 arXiv:2206.14486 [cs, stat]. URL: http://arxiv.org/abs/2206.14486, doi:10.48550/arXiv.2206.14486.
  127. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Technical Report arXiv:2206.04615, arXiv, June 2022. arXiv:2206.04615 [cs, stat] type: article. URL: http://arxiv.org/abs/2206.04615.
  128. Richard Sutton. The bitter lesson, 2019. URL: http://www.incompleteideas.net/IncIdeas/BitterLesson.html.
  129. Christian Szegedy. A promising path towards autoformalization and general artificial intelligence. In International Conference on Intelligent Computer Mathematics, 2020.
  130. Chess as a Testbed for Language Model State Tracking, May 2022. arXiv:2102.13249 [cs]. URL: http://arxiv.org/abs/2102.13249, doi:10.48550/arXiv.2102.13249.
  131. Richard E. Turner. An Introduction to Transformers, July 2023. arXiv:2304.10557 [cs]. URL: http://arxiv.org/abs/2304.10557, doi:10.48550/arXiv.2304.10557.
  132. Attention Is All You Need. June 2017. arXiv: 1706.03762. URL: https://arxiv.org/abs/1706.03762.
  133. Emergent Abilities of Large Language Models. 2022. Publisher: arXiv Version Number: 2. URL: https://arxiv.org/abs/2206.07682, doi:10.48550/ARXIV.2206.07682.
  134. On the Practical Computational Power of Finite Precision RNNs for Language Recognition, May 2018. arXiv:1805.04908 [cs, stat]. URL: http://arxiv.org/abs/1805.04908, doi:10.48550/arXiv.1805.04908.
  135. NaturalProofs: Mathematical Theorem Proving in Natural Language. page 14, 2021.
  136. The Learnability of In-Context Learning, March 2023. arXiv:2303.07895 [cs]. URL: http://arxiv.org/abs/2303.07895, doi:10.48550/arXiv.2303.07895.
  137. Avi Wigderson. Mathematics and computation: A theory revolutionizing technology and science. Princeton University Press, 2019.
  138. Wikipedia. URL: https://en.wikipedia.org/wiki/Reflective_programming.
  139. Stephen Wolfram. What Is ChatGPT Doing… and Why Does It Work? Stephen Wolfram, 2023.
  140. An Explanation of In-context Learning as Implicit Bayesian Inference, July 2022. arXiv:2111.02080 [cs]. URL: http://arxiv.org/abs/2111.02080, doi:10.48550/arXiv.2111.02080.
  141. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer, March 2022. arXiv:2203.03466 [cond-mat]. URL: http://arxiv.org/abs/2203.03466, doi:10.48550/arXiv.2203.03466.
  142. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, May 2023. arXiv:2305.10601 [cs]. URL: http://arxiv.org/abs/2305.10601, doi:10.48550/arXiv.2305.10601.
  143. Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models, May 2023. arXiv:2305.17311 [cs]. URL: http://arxiv.org/abs/2305.17311, doi:10.48550/arXiv.2305.17311.
  144. Do Transformers Parse while Predicting the Masked Word?, March 2023. arXiv:2303.08117 [cs]. URL: http://arxiv.org/abs/2303.08117, doi:10.48550/arXiv.2303.08117.
  145. A Survey of Large Language Models, September 2023. arXiv:2303.18223 [cs]. URL: http://arxiv.org/abs/2303.18223, doi:10.48550/arXiv.2303.18223.
  146. MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics, February 2022. arXiv:2109.00110 [cs]. URL: http://arxiv.org/abs/2109.00110, doi:10.48550/arXiv.2109.00110.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Michael R. Douglas (24 papers)
Citations (393)
X Twitter Logo Streamline Icon: https://streamlinehq.com