GPT-4 Technical Report (2303.08774v6)
Abstract: We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Transformer-XL: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
- Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235, 2018.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022a.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
- Tensor Programs V: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Outrageously large neural networks: The sparsely-gated Mixture-of-Experts layer. arXiv preprint arXiv:1701.06538, 2017.
- ST-MoE: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.
- Emergent abilities of large language models. TMLR, 2022b.
- Universal transformers. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyzdRiR9Y7.
- RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.
- PaLI: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022a.
- GPT-J-6B: A 6 billion parameter autoregressive language model, 2021.
- GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. If you use this software, please cite it using these metadata, 58, 2021.
- Bloom: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
- Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. arXiv preprint arXiv:2205.14135, 2022.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Self-attention does not need o(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
- Gpu kernels for block-sparse weights, 2017. URL https://cdn.openai.com/blocksparse/blocksparsepaper.pdf.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021a.
- Aligning AI with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021b.
- Language models are unsupervised multitask learners. 2019.
- Improving language understanding by generative pre-training. 2018.
- Attention is all you need. NeurIPS, 2017.
- Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- The computational limits of deep learning. arXiv preprint arXiv:2007.05558, 2020.
- Evaluating large language models trained on code. 2021.
- The Inverse Scaling Prize, 2022a. URL https://github.com/inverse-scaling/prize.
- Inverse scaling can become U-shaped. arXiv preprint arXiv:2211.02011, 2022c.
- Inverse Scaling Prize: First round winners, 2022b. URL https://irmckenzie.co.uk/round1.
- OpenAI: OpenAI API, 2020. URL https://openai.com/blog/openai-api.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
- Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.
- Think you have solved question answering? Try ARC, the AI2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- WinoGrande: An adversarial Winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
- CodeT: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022b.
- DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URL https://aclanthology.org/N19-1246.
- Question directed graph attention network for numerical reasoning over text. arXiv preprint arXiv:2009.07448, 2020.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022.
- Solving math word problems with process- and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
- OpenAI. OpenAI: GPT-4, 2023a. URL https://openai.com/research/gpt-4.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- OpenAI. OpenAI: How should AI systems behave, and who should decide?, 2023b. URL https://openai.com/blog/how-should-ai-systems-behave.
- OpenAI: Our approach to alignment research, 2022. URL https://openai.com/blog/our-approach-to-alignment-research.
- Joseph Carlsmith. Is power-seeking AI an existential risk? ArXiv, abs/2206.13353, 2022.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Dora Seigel. How do you calculate SAT score? raw and scaled, 1 2020. URL https://blog.prepscholar.com/how-to-calculate-sat-score.
- The Albert blog. URL https://www.albert.io/blog/.
- Mathematical Association of America. AMC statistics, 2023. URL http://amc-reg.maa.org/Reports/GeneralReports.aspx.
- Halle Edwards. SAT percentiles and score rankings, 2022. URL https://blog.prepscholar.com/sat-percentiles-and-score-rankings.
- College Board. Understanding SAT scores, 2022a. URL https://satsuite.collegeboard.org/media/pdf/understanding-sat-scores.pdf.
- College Board. AP score distributions by subject, 2022b. URL https://apcentral.collegeboard.org/media/pdf/ap-score-distributions-by-subject-2022.pdf.
- Center for Excellence in Education. 2020 USABO Semifinal exam score distribution, 2022. URL https://www.usabo-trc.org/sites/default/files/allfiles/2020%20USABO%20Semifinal%20Exam%20Histogram.pdf.
- Chris Swimmer. GRE score percentiles – what does your score mean for you? (2021 update), 4 2021. URL https://magoosh.com/gre/gre-score-percentiles/.
- John B. Nici. AP Art History: 5 Practice Tests + Comprehensive Review + Online Practice. Barron’s Test Prep. Barron’s Educational Series, 2020. ISBN 9781506260501.
- ETS. GRE sample issue task, 2022. URL https://www.ets.org/pdfs/gre/sample-issue-task.pdf.
- Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229, January 2019. doi: 10.1145/3287560.3287596.
- System Cards, a new resource for understanding how AI systems work. https://ai.facebook.com/blog/system-cards-a-new-resource-for-understanding-how-ai-systems-work/, February 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.