LLaMA: Open and Efficient Foundation Language Models (2302.13971v1)

Published 27 Feb 2023 in cs.CL

Abstract: We introduce LLaMA, a collection of foundation LLMs ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

PDF Abstract

The paper introduces LLaMA (LLM), a collection of foundation LLMs ranging from 7B to 65B parameters. The authors demonstrate the possibility of training SOTA models using exclusively publicly available datasets. The LLaMA-13B model outperforms GPT-3 (175B) on most benchmarks, and the LLaMA-65B model is competitive with Chinchilla-70B and PaLM-540B. All models are released to the research community.

The authors' approach to training is similar to previous work and inspired by the Chinchilla scaling laws. They trained large transformers on a large quantity of textual data using a standard optimizer.

The pre-training dataset is a mixture of several sources, covering a diverse set of domains. The data sources are mostly re-used from other LLMs with the restriction of using publicly available data, and data compatible with open sourcing. The training set consists of:

English CommonCrawl (67%) which uses the CCNet pipeline for preprocessing such as deduplication, language identification, and low-quality content filtering.
C4 (15%) which contains deduplication and language identification steps; the main difference with CCNet is the quality filtering, which relies on heuristics such as the presence of punctuation marks or the number of words and sentences in a webpage.
Github (4.5%) which uses the public GitHub dataset available on Google BigQuery, filtered for license type and low-quality files, and deduplicated at the file level.
Wikipedia (4.5%) which includes dumps from June-August 2022 covering 20 languages using either the Latin or Cyrillic scripts, processed to remove hyperlinks, comments, and formatting boilerplate.
Gutenberg and Books3 (4.5%) which includes two book corpora in the training dataset: the Gutenberg Project and the Books3 section of ThePile, deduplicated at the book level.
ArXiv (2.5%) which processes arXiv Latex files, removing everything before the first section, as well as the bibliography, comments, and inline-expanded definitions and macros.
Stack Exchange (2%) which uses a dump of Stack Exchange, removing the HTML tags from text and sorting the answers by score.

The data is tokenized with the byte-pair encoding (BPE) algorithm, using the implementation from SentencePiece. The entire training dataset contains roughly 1.4T tokens after tokenization. Each token is used only once during training, with the exception of the Wikipedia and Books domains, over which approximately two epochs are performed.

The network is based on the transformer architecture. The differences with the original architecture are:

Pre-normalization: the input of each transformer sub-layer is normalized, instead of normalizing the output, using the RMSNorm normalizing function.
SwiGLU activation function: the ReLU non-linearity is replaced by the SwiGLU activation function, using a dimension of $\frac23 4d$ instead of $4d$.
Rotary Embeddings: the absolute positional embeddings are removed, and instead, rotary positional embeddings (RoPE) are added at each layer of the network.

The models are trained using the AdamW optimizer, with $\beta_1 = 0.9$ and $\beta_2 = 0.95$ . A cosine learning rate schedule is used, such that the final learning rate is equal to 10% of the maximal learning rate, with a weight decay of $0.1$ and gradient clipping of $1.0$. $2,000$ warmup steps are used, and the learning rate and batch size are varied with the size of the model.

Several optimizations were made to improve the training speed of the models. First, an efficient implementation of the causal multi-head attention is used to reduce memory usage and runtime. Second, the amount of activations that are recomputed during the backward pass with checkpointing is reduced. When training a 65B-parameter model, the code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM, meaning that training over the dataset containing 1.4T tokens takes approximately 21 days.

The authors consider zero-shot and few-shot tasks, and report results on a total of 20 benchmarks, and compare LLaMA with other foundation models, namely GPT-3, Gopher, Chinchilla and PaLM, as well as the open-sourced OPT models, GPT-J, and GPT-Neo. LLaMA is also briefly compared with instruction-tuned models such as OPT-IML and Flan-PaLM.

In the multiple choice tasks, the objective is to select the most appropriate completion among a set of given options, based on a provided context. The completion with the highest likelihood given the provided context is selected. The likelihood normalized by the number of characters in the completion is used, except for certain datasets (OpenBookQA, BoolQ), for which a completion is selected based on the likelihood normalized by the likelihood of the completion given ``Answer:'' as context: ${\scriptstyle P ( \mathtt{completion} \mid \mathtt{context})/P(\mathtt{completion} \mid ``Answer:" ) }$ .

$P ( \mathtt{completion} \mid \mathtt{context})$ is the probability of the completion given the context.

$P(\mathtt{completion} \mid ``Answer:" )$ is the probability of the completion given the string "Answer:".

For common sense reasoning benchmarks, LLaMA-65B outperforms Chinchilla-70B on all reported benchmarks but BoolQ, and surpasses PaLM-540B everywhere but on BoolQ and WinoGrande. LLaMA-13B model also outperforms GPT-3 on most benchmarks despite being 10 $\times$ smaller.

On closed-book question answering benchmarks, LLaMA-65B achieves SOTA performance in the zero-shot and few-shot settings. The LLaMA-13B is also competitive on these benchmarks with GPT-3 and Chinchilla, despite being 5-10 $\times$ smaller.

On the RACE reading comprehension benchmark, LLaMA-65B is competitive with PaLM-540B, and, LLaMA-13B outperforms GPT-3 by a few percents.

On GSM8k, LLaMA-65B outperforms Minerva-62B, although it has not been fine-tuned on mathematical data.

For code generation, LLaMA outperforms other general models such as LaMDA and PaLM. LLaMA with 13B parameters and more outperforms LaMDA 137B on both HumanEval and MBPP. LLaMA 65B also outperforms PaLM 62B, even when it is trained longer.

On MMLU, the LLaMA-65B is behind both Chinchilla-70B and PaLM-540B by a few percent in average, and across most domains.

The authors show that briefly finetuning on instructions data rapidly leads to improvements on MMLU. The non-finetuned version of LLaMA-65B is already able to follow basic instructions, and a very small amount of finetuning improves the performance on MMLU, and further improves the ability of the model to follow instructions. The results of their instruct model LLaMA-I on MMLU are compared with existing instruction finetuned models of moderate sizes, namely, OPT-IML and the Flan-PaLM series. Despite the simplicity of the instruction finetuning approach used here, a 68.9% on MMLU is reached.

The potential harm of LLaMA-65B is evaluated on different benchmarks that measure toxic content production and stereotypes detection. The authors find that these evaluations are not sufficient to fully understand the risks associated with these models.

For each of the $100$k prompts from the RealToxicityPrompts benchmark, the toxicity score is measured. The score per prompt ranges from 0 (non-toxic) to 1 (toxic). It is observed that toxicity increases with the size of the model, especially for Respectful prompts.

The biases in the model are evaluated on the CrowS-Pairs, which allows to measure biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance and socioeconomic status. LLaMA compares slightly favorably to both GPT-3 and OPT-175B on average. The model is particularly biased in the religion category (+10\% compared to OPT-175B), followed by age and gender.

To further investigate the biases of the model on the gender category, the WinoGender benchmark is used. The model is significantly better at performing co-reference resolution for the their/them/someone'' pronouns than for theher/her/she'' and his/him/he'' pronouns, which is likely indicative of gender bias. The model, **LLaMA**-65B, makes more errors on the gotcha examples, clearly showing that it capture societal biases related to gender and occupation. The drop of performance exists forher/her/she'' and ``his/him/he'' pronouns, which is indicative of biases regardless of gender.

On the TruthfulQA benchmark, compared to GPT-3, the model scores higher in both categories, but the rate of correct answers is still low, showing that the model is likely to hallucinate incorrect answers.

The training of the models have consumed a massive quantity of energy, responsible for the emission of carbon dioxide. The total energy consumption and the resulting carbon footprint are broken down, estimating that developing these models would have cost around 2,638 MWh, and a total emission of 1,015 $tCO_2eq$ .