Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

820 40

GPT-4 Technical Report (2303.08774v6)

Published 15 Mar 2023 in cs.CL and cs.AI

Abstract: We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.

PDF HTML Abstract

GPT-4 Technical Report Overview

The "GPT-4 Technical Report" by OpenAI delineates the intricate architecture, capabilities, limitations, and safety measures associated with GPT-4, a large-scale multimodal model designed to process both image and text inputs, and produce text outputs.

Model Architecture and Training

GPT-4 is distinguished by its multimodal capabilities, allowing it to handle image and text inputs simultaneously, a significant advancement over previous iterations like GPT-3.5. The foundational structure of GPT-4 remains rooted in the Transformer architecture, emphasizing scale predictability and infrastructural robustness across various scales. This predictability was achieved through meticulous tuning and optimization methods that allowed performance predictions based on models trained with significantly less computational resources.

Performance Benchmarks

GPT-4's performance was assessed using a diverse set of benchmarks including professional exams, traditional NLP tasks, and multilingual tests. Notably:

Professional and Academic Exams: GPT-4 demonstrated impressive capabilities, outscoring GPT-3.5 across multiple exams. For instance, it ranked in the top 10% of a simulated Uniform Bar Examination and achieved the 99th percentile on the Graduate Record Examination (GRE) Verbal section.
NLP Benchmarks: On the MMLU benchmark, GPT-4 outperformed the previous state-of-the-art, achieving an 86.4% accuracy compared to GPT-3.5's 70.0%. Additionally, in tasks like HellaSwag, and ARC, it consistently surpassed the highest scores of existing LLMs.
HumanEval: GPT-4 achieved a 67.0% pass rate on the HumanEval dataset, which measures the ability to synthesize Python functions, significantly better than GPT-3.5.

Multilingual Capabilities

GPT-4 exhibited noteworthy performance across languages, outperforming previous models, including Chinchilla and PaLM, in various non-English languages. This includes low-resource languages like Latvian and Welsh, suggesting that GPT-4 has broadened the horizon for multilingual NLP efficacy.

Safety and Alignment

GPT-4 incorporates Reinforcement Learning from Human Feedback (RLHF), enhancing its adherence to user intent and enhancing safety measures. This involved adversarial testing with over 50 domain experts and a sophisticated model-assisted safety pipeline. These measures have substantially improved its ability to refuse inappropriate requests while reducing the occurrence of toxic outputs.

Mitigation Strategies: GPT-4 was subjected to adversarial testing by experts in fields such as cybersecurity and bio-risk, identifying and mitigating potential safety risks. Measures include rule-based reward models (RBRMs) to guide responses appropriately and reduce undesired behaviors.
Safety Metrics: The model's tendency to respond to disallowed content was reduced by 82% compared to GPT-3.5, and toxic responses were significantly lower on the RealToxicityPrompts dataset.

Vision Capabilities

GPT-4’s ability to process visual inputs was highlighted with various examples, including interpreting diagrams, providing explanations for memes, and solving exam questions with visual components. Preliminary results indicate that the model retains its robust language processing skills while adeptly handling visual inputs.

Limitations

Despite substantial advancements, GPT-4 retains some limitations:

Hallucination Issues: It may still generate incorrect information or reasoning errors, necessitating careful validation in high-stakes contexts.
Knowledge Cutoff: The model's training data mostly ends in September 2021, which limits its knowledge of subsequent events.
Context Window: Similar to its predecessors, GPT-4 has a limited context window, affecting its ability to handle very long textual inputs.

Future Directions

Future research and development efforts will likely focus on enhancing the model’s safety, reliability, and expanding its understanding to more recent data. Moreover, the report emphasizes the importance of continued collaboration with external researchers to assess and mitigate potential risks as model capabilities expand.

Conclusion

GPT-4 represents a significant step in AI development, showcasing enhanced capabilities in both language and multimodal tasks. Its performance across a broad spectrum of benchmarks underscores its potential for diverse applications, while ongoing efforts to improve its safety and reliability mark a crucial direction for future advancements. The report provides a comprehensive view of GPT-4's architecture, performance, and safety considerations, offering valuable insights for the continued evolution of AI technologies.

PDF Markdown Bookmark Chat (Pro)

References (85)

Authors (281)

OpenAI (6 papers)
Josh Achiam (2 papers)
Steven Adler (5 papers)
Sandhini Agarwal (10 papers)
Lama Ahmad (8 papers)
Ilge Akkaya (6 papers)
Florencia Leoni Aleman (1 paper)
Diogo Almeida (13 papers)
Janko Altenschmidt (1 paper)
Sam Altman (2 papers)
Shyamal Anadkat (2 papers)
Red Avila (1 paper)
Igor Babuschkin (14 papers)
Suchir Balaji (4 papers)
Valerie Balcom (1 paper)
Paul Baltescu (4 papers)
Haiming Bao (2 papers)
Jeff Belgum (1 paper)
Irwan Bello (12 papers)
Jake Berdine (1 paper)

Citations (9,415)

View on Semantic Scholar

Tweets

https://twitter.com/simonw/status/1773036296597868810

https://twitter.com/dylan522p/status/1772972997990580381

https://twitter.com/andrewwhite01/status/1788183051778114005

https://twitter.com/stalkermustang/status/1819006626671558990

https://twitter.com/amplifiedamp/status/1765110446930247729

https://twitter.com/frabcus/status/1797076524140491191

YouTube

Show All Videos