GPT-NeoX-20B: An Open-Source Autoregressive Language Model (2204.06745v1)

Published 14 Apr 2022 in cs.CL

Abstract: We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive LLM trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe \model{}'s architecture and training and evaluate its performance on a range of language-understanding, mathematics, and knowledge-based tasks. We find that GPT-NeoX-20B is a particularly powerful few-shot reasoner and gains far more in performance when evaluated five-shot than similarly sized GPT-3 and FairSeq models. We open-source the training and evaluation code, as well as the model weights, at https://github.com/EleutherAI/gpt-neox.

PDF Abstract

GPT-NeoX-20B: An Open-Source Autoregressive LLM

The paper "GPT-NeoX-20B: An Open-Source Autoregressive LLM" by Sid Black et al., presents GPT-NeoX-20B, a 20 billion parameter Transformer-based autoregressive LLM trained on the Pile corpus. The model's weights are made publicly available under a permissive open-source license, endorsing the principles of accessibility and transparency in AI research.

Architecture and Training

GPT-NeoX-20B is built upon a Transformer decoder architecture, closely following the GPT-3 model design but incorporating several key deviations to optimize performance. The architectural details include:

44 layers with a hidden dimension of 6144 and 64 attention heads.
Rotary Positional Embeddings replacing learned positional embeddings, applying them to the first 25% of embedding vectors.
Parallel computation of attention and feed-forward layers, enhancing training efficiency by reducing communication overhead.
Custom initialization methods tailored to maintain stability in deep and wide architectures.
Exclusive use of dense layers, opting out of hybrid sparse-dense layers for simplicity.

Training was performed on twelve Supermicro servers, each equipped with eight NVIDIA A100 GPUs and AMD EPYC CPUs, utilizing effective data and model parallelism strategies to maintain high computational efficiency.

Training Data and Tokenization

GPT-NeoX-20B was trained on the Pile, a highly diverse dataset consisting of 22 different data sources, encompassing academic writing, web scrapes, prose, dialogue, and miscellaneous categories. This selection aims to ensure broad applicability and robustness of the model across various domains.

The model employs a BPE-based tokenizer with a vocabulary size of 50257, trained specifically on the Pile to better handle the dataset's diversity. Key features of this tokenizer include:

Consistent space delimitation across all tokens.
Inclusion of tokens for repeated spaces to handle code and prose more efficiently.
General improvements over the GPT-2 tokenizer, reducing the number of tokens required to represent text.

Evaluation and Performance

The model's performance was evaluated on a comprehensive set of benchmarks, categorized into natural language tasks, mathematical tasks, and advanced knowledge-based tasks. Key results include:

Natural Language Tasks: GPT-NeoX-20B exhibits competitive performance relative to similarly sized models such as FairSeq 13B and various GPT-3 models. It notably excels in specific tasks like LAMBADA, PIQA, and ARC.
Mathematical Tasks: The model demonstrates robust arithmetic capabilities, outperforming GPT-3 and FairSeq models on several numerical benchmarks.
Advanced Knowledge-Based Tasks: GPT-NeoX-20B shows significant improvements in few-shot learning scenarios, particularly in knowledge-intensive domains. It effectively leverages few-shot examples, outperforming GPT-3 and FairSeq models by substantial margins.

Broader Impacts and Limitations

The authors emphasize the importance of releasing large-scale models to the public for advancing AI ethics and alignment research. Public access to GPT-NeoX-20B is expected to empower independent researchers and small organizations, mitigating the centralization of AI capabilities.

The paper acknowledges potential limitations, including:

The hyperparameters, while optimized through smaller scale experiments, may not be fully optimal for the final model.
The lack of deduplication in the training data, a standard practice in recent models, which could influence performance and generalization.
Absence of evaluations on coding-specific benchmarks, despite architectural optimizations aimed at improving coding task performance.

Environmental Impact

Training and developing GPT-NeoX-20B entailed significant energy consumption, emitting approximately 31.73 metric tons of CO2. This figure highlights the environmental cost associated with training LLMs, calling for broader discussions on sustainable AI practices.

Conclusion

GPT-NeoX-20B represents a substantial contribution to the research community, balancing cutting-edge performance with the principles of open access and transparency. By releasing the code and weights, the authors facilitate further research in AI safety, interpretability, and several other critical areas. As LLMs continue to evolve, GPT-NeoX-20B sets a precedent for responsible and inclusive AI research, encouraging a collaborative approach to overcoming the challenges and leveraging the potential of advanced AI systems. Future research could focus on optimizing training efficiency, improving generalization through data deduplication, and expanding benchmarks to include coding tasks.