GPT-4 Technical Report (2303.08774v6)
Abstract: We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Transformer-XL: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
- Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235, 2018.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022a.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
- Tensor Programs V: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Outrageously large neural networks: The sparsely-gated Mixture-of-Experts layer. arXiv preprint arXiv:1701.06538, 2017.
- ST-MoE: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.
- Emergent abilities of large language models. TMLR, 2022b.
- Universal transformers. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyzdRiR9Y7.
- RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.
- PaLI: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022a.
- GPT-J-6B: A 6 billion parameter autoregressive language model, 2021.
- GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. If you use this software, please cite it using these metadata, 58, 2021.
- Bloom: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
- Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. arXiv preprint arXiv:2205.14135, 2022.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Self-attention does not need o(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
- Gpu kernels for block-sparse weights, 2017. URL https://cdn.openai.com/blocksparse/blocksparsepaper.pdf.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021a.
- Aligning AI with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021b.
- Language models are unsupervised multitask learners. 2019.
- Improving language understanding by generative pre-training. 2018.
- Attention is all you need. NeurIPS, 2017.
- Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- The computational limits of deep learning. arXiv preprint arXiv:2007.05558, 2020.
- Evaluating large language models trained on code. 2021.
- The Inverse Scaling Prize, 2022a. URL https://github.com/inverse-scaling/prize.
- Inverse scaling can become U-shaped. arXiv preprint arXiv:2211.02011, 2022c.
- Inverse Scaling Prize: First round winners, 2022b. URL https://irmckenzie.co.uk/round1.
- OpenAI: OpenAI API, 2020. URL https://openai.com/blog/openai-api.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
- Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.
- Think you have solved question answering? Try ARC, the AI2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- WinoGrande: An adversarial Winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
- CodeT: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022b.
- DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URL https://aclanthology.org/N19-1246.
- Question directed graph attention network for numerical reasoning over text. arXiv preprint arXiv:2009.07448, 2020.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022.
- Solving math word problems with process- and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
- OpenAI. OpenAI: GPT-4, 2023a. URL https://openai.com/research/gpt-4.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- OpenAI. OpenAI: How should AI systems behave, and who should decide?, 2023b. URL https://openai.com/blog/how-should-ai-systems-behave.
- OpenAI: Our approach to alignment research, 2022. URL https://openai.com/blog/our-approach-to-alignment-research.
- Joseph Carlsmith. Is power-seeking AI an existential risk? ArXiv, abs/2206.13353, 2022.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Dora Seigel. How do you calculate SAT score? raw and scaled, 1 2020. URL https://blog.prepscholar.com/how-to-calculate-sat-score.
- The Albert blog. URL https://www.albert.io/blog/.
- Mathematical Association of America. AMC statistics, 2023. URL http://amc-reg.maa.org/Reports/GeneralReports.aspx.
- Halle Edwards. SAT percentiles and score rankings, 2022. URL https://blog.prepscholar.com/sat-percentiles-and-score-rankings.
- College Board. Understanding SAT scores, 2022a. URL https://satsuite.collegeboard.org/media/pdf/understanding-sat-scores.pdf.
- College Board. AP score distributions by subject, 2022b. URL https://apcentral.collegeboard.org/media/pdf/ap-score-distributions-by-subject-2022.pdf.
- Center for Excellence in Education. 2020 USABO Semifinal exam score distribution, 2022. URL https://www.usabo-trc.org/sites/default/files/allfiles/2020%20USABO%20Semifinal%20Exam%20Histogram.pdf.
- Chris Swimmer. GRE score percentiles – what does your score mean for you? (2021 update), 4 2021. URL https://magoosh.com/gre/gre-score-percentiles/.
- John B. Nici. AP Art History: 5 Practice Tests + Comprehensive Review + Online Practice. Barron’s Test Prep. Barron’s Educational Series, 2020. ISBN 9781506260501.
- ETS. GRE sample issue task, 2022. URL https://www.ets.org/pdfs/gre/sample-issue-task.pdf.
- Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229, January 2019. doi: 10.1145/3287560.3287596.
- System Cards, a new resource for understanding how AI systems work. https://ai.facebook.com/blog/system-cards-a-new-resource-for-understanding-how-ai-systems-work/, February 2022.