The Llama 3 Herd of Models

Published 31 Jul 2024 in cs.AI, cs.CL, and cs.CV | (2407.21783v3)

Abstract: Modern AI systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of LLMs that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading LLMs such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter LLM and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

Abstract PDF HTML Upgrade to Chat

Citations (1,756)

View on Semantic Scholar

Summary

The paper introduces a family of dense Transformer models ranging from 8B to 405B parameters with compute-optimal training and efficient 4D parallelism.
The methodology leverages scaling laws, aggressive de-duplication, and targeted data annealing to optimize performance for diverse tasks including reasoning, code, and safety.
The paper demonstrates competitive performance with state-of-the-art models through multimodal extensions, robust safety measures, and effective tool use integration.

The Llama 3 Herd of Models: Technical Summary and Implications

Model Architecture and Scaling

Llama 3 introduces a family of dense Transformer-based LLMs with parameter counts of 8B, 70B, and 405B, supporting context windows up to 128K tokens. The architecture remains close to Llama 2, with minor modifications such as grouped query attention (GQA) for improved inference efficiency, a 128K-token vocabulary for enhanced multilingual support, and an increased RoPE base frequency to facilitate long-context modeling. The flagship 405B model is trained with $3.8 \times 10^{25}$ FLOPs, nearly 50× the compute of Llama 2's largest variant, and is positioned as compute-optimal per scaling law analysis.

Figure 1: Illustration of the overall architecture and training of Llama 3. Llama 3 is a Transformer LLM trained to predict the next token of a textual sequence.

Scaling law experiments were conducted across compute budgets from $6 \times 10^{18}$ to $10^{22}$ FLOPs, with model sizes ranging from 40M to 16B parameters. The scaling law methodology correlates negative log-likelihood on downstream tasks with training FLOPs and then maps log-likelihood to task accuracy, enabling accurate performance forecasting for large-scale models.

Figure 2: Scaling law IsoFLOPs curves between $6 \times 10^{18}$ and $10^{22}$ FLOPs, used to identify compute-optimal model sizes.

Figure 3: Scaling law forecast for ARC Challenge, showing the relationship between pre-training FLOPs, negative log-likelihood, and benchmark accuracy.

Data Curation and Pre-Training

Llama 3's pre-training corpus comprises 15T multilingual tokens, with a data mix of 50% general knowledge, 25% math/reasoning, 17% code, and 8% multilingual content. The data pipeline includes aggressive de-duplication at URL, document, and line levels, heuristic and model-based quality filtering, and domain-specific extraction for code and reasoning data. Annealing strategies are employed to upsample high-quality code and math data in later training stages, yielding significant performance gains for smaller models but negligible improvements for the 405B model, indicating strong in-context learning at scale.

Infrastructure and Parallelism

Training the 405B model required up to 16K H100 GPUs on Meta's production clusters, leveraging a three-layer Clos RoCE network topology with 400 Gbps interconnects. The model employs 4D parallelism—tensor, pipeline, context, and fully sharded data parallelism (FSDP)—to efficiently shard computation and memory across the cluster.

Figure 4: Illustration of 4D parallelism, showing the organization of GPUs into tensor, context, pipeline, and data parallel groups.

Pipeline parallelism is enhanced with tunable micro-batching and interleaved scheduling to address memory and computation imbalances, enabling efficient training without activation checkpointing for 8K-token sequences.

Figure 5: Illustration of pipeline parallelism in Llama 3, with tunable micro-batch scheduling for improved efficiency.

Post-Training and Alignment

Post-training consists of supervised finetuning (SFT), rejection sampling (RS), and direct preference optimization (DPO), iterated over six rounds. The process is supported by a reward model trained on human-annotated preference data, with additional data curation for code, multilinguality, reasoning, long context, tool use, and steerability. DPO is preferred over PPO for stability and efficiency at scale, with modifications such as masking formatting tokens and regularization with NLL loss.

Figure 6: Illustration of the overall post-training approach for Llama 3, involving rejection sampling, supervised finetuning, and direct preference optimization.

Multimodal Extensions

Llama 3 explores compositional integration of image, video, and speech modalities via adapters and pre-trained encoders. The vision module uses a ViT-H/14 encoder with cross-attention adapters, while the video module aggregates temporal features with perceiver resamplers. The speech interface employs a Conformer encoder and adapter, supporting ASR, AST, and spoken dialogue in 34 languages. These multimodal models are not yet publicly released but demonstrate competitive performance on image, video, and speech benchmarks.

Tool Use and Multi-Step Reasoning

Llama 3 natively supports tool use, including search engines, code interpreters, and mathematical engines, with both single-step and multi-step planning and execution.

Figure 7: Multi-step tool usage. Example of Llama 3 performing multi-step planning, reasoning, and tool calling to solve a task.

Performance Evaluation

Llama 3 405B achieves parity with GPT-4 and Claude 3.5 Sonnet on a wide range of benchmarks, including MMLU, MATH, HumanEval, and ARC Challenge. The 8B and 70B models are best-in-class for their size. Robustness analyses show minimal sensitivity to prompt format, label order, and answer permutations, especially for the 405B model. Contamination analysis indicates that, while some benchmarks are affected by training data overlap, the impact is generally limited.

Safety and System-Level Guardrails

Safety is addressed at both model and system levels. Pre-training data is filtered for PII and unsafe content, and post-training includes adversarial and borderline prompts to balance violation and false refusal rates. Llama Guard 3, an 8B-parameter classifier, is released for input/output filtering across 13 harm categories, with quantized variants for efficient deployment. Red teaming and uplift studies for cybersecurity and CBRNE risks show no significant increase in risk compared to web search.

Inference Optimization

Inference for the 405B model is enabled via pipeline parallelism across 16 GPUs and FP8 quantization for feedforward layers, yielding up to 50% throughput improvement with negligible impact on output quality. Micro-batching further enhances throughput-latency trade-offs.

Implications and Future Directions

Llama 3 demonstrates that scaling dense Transformer models with high-quality, diverse data and robust infrastructure yields models competitive with the best closed-source systems. The compositional approach to multimodality allows for modular extension without degrading text-only performance. The open release of pre-trained and post-trained models, including safety classifiers, provides a strong foundation for research and deployment in both academic and industrial settings.

Theoretical implications include the validation of scaling laws for both loss and downstream task accuracy, the effectiveness of compositional multimodal integration, and the diminishing returns of domain-specific annealing at extreme scale. Practically, Llama 3 sets a new standard for open models in multilinguality, code, reasoning, tool use, and safety, with robust infrastructure and deployment strategies.

Future work will likely focus on further scaling, more efficient architectures (potentially revisiting MoE or sparse models), improved multimodal integration, and enhanced system-level safety. The compositional adapter paradigm may become the default for extending LLMs to new modalities and capabilities. The open release strategy is expected to accelerate progress toward more general and robust AI systems.

Conclusion

Llama 3 establishes a new benchmark for open foundation models, combining scale, data quality, and engineering simplicity to deliver state-of-the-art performance across language, code, reasoning, and emerging multimodal tasks. The model's robust safety mitigations, infrastructure optimizations, and open release position it as a central resource for the research community and industry, with significant implications for the development and deployment of future AI systems.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (559)

First 10 authors:

Collections

Tweets

YouTube

Show All Videos

HackerNews

The Llama 3 Herd of Models (1 point, 0 comments)
The Llama 3 Herd of Models (1 point, 0 comments)

The Llama 3 Herd of Models (3 points, 1 comment)

The Llama 3 Herd of Models

Summary

The Llama 3 Herd of Models: Technical Summary and Implications

Model Architecture and Scaling

Data Curation and Pre-Training

Infrastructure and Parallelism

Post-Training and Alignment

Multimodal Extensions

Tool Use and Multi-Step Reasoning

Performance Evaluation

Safety and System-Level Guardrails

Inference Optimization

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (559)

Collections

Tweets

YouTube

HackerNews

Reddit