The Llama 3 Herd of Models (2407.21783v1)

Published 31 Jul 2024 in cs.AI, cs.CL, and cs.CV

Abstract: Modern AI systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of LLMs that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading LLMs such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter LLM and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

PDF HTML Abstract

This paper introduces the Llama 3 family of foundation LLMs, developed by Meta AI. The family includes dense Transformer models with 8B, 70B, and a flagship 405B parameters, designed to be multilingual and proficient in coding, reasoning, and tool usage, supporting context windows up to 128K tokens. The paper details the development process, focusing on three key levers: data, scale, and managing complexity.

Development Strategy:

Data: Llama 3 was pre-trained on a significantly larger and higher-quality dataset compared to Llama 2, totaling approximately 15.6 trillion multilingual tokens (vs. 1.8T for Llama 2). This involved rigorous pre-processing, PII/safety filtering, deduplication (URL, document, line-level), heuristic filtering, and model-based quality filtering using custom classifiers. Specific pipelines were developed for code and math data. The data mix (roughly 50% general knowledge, 25% math/reasoning, 17% code, 8% multilingual) was determined using scaling law experiments. Data annealing with high-quality sources was used in the final pre-training stages.
Scale: The 405B model represents a massive scale-up, trained using $3.8 \times 10^{25}$ FLOPs (nearly 50x Llama 2 70B). While the 405B model size is near compute-optimal for the budget, the smaller 8B and 70B models were trained significantly longer than compute-optimal to enhance performance at their respective inference budgets.
Managing Complexity: A standard dense Transformer architecture was chosen over Mixture-of-Experts (MoE) for better training stability. Post-training relied on simpler, scalable methods like Supervised Finetuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO) instead of more complex RL algorithms.

Model Architecture & Pre-training:

Architecture: Llama 3 uses a standard Transformer architecture with minor modifications from Llama 2: Grouped Query Attention (GQA) for faster inference, an attention mask preventing cross-document attention (important for long context), a larger 128K token vocabulary (tiktoken base + multilingual additions for better compression), and an increased RoPE base frequency (500,000) for longer context support.
Scaling Laws: A two-stage methodology was used to predict downstream task performance from training FLOPs, establishing correlations between FLOPs -> NLL -> Accuracy. This informed the selection of the 405B parameter count and 15.6T token training dataset size.
Infrastructure: Training utilized Meta's production clusters with up to 16,000 H100 GPUs on RoCE or InfiniBand networks. Custom optimizations were made for network topology awareness, load balancing (E-ECMP), and congestion control. Storage relied on the Tectonic distributed file system.
Parallelism: 4D parallelism (Tensor, Pipeline, Context, and Fully Sharded Data Parallelism - FSDP) was employed. Significant improvements were made to pipeline parallelism for flexibility, memory/computation balancing, and reducing bubbles. Context parallelism used an all-gather approach for sequence dimension partitioning, enabling 128K context. Network-aware configuration and FP32 gradient accumulation ensured stability. Customizations to NCCL (NCCLX) improved collective communication performance. High reliability (>90% effective training time) was achieved despite frequent hardware issues common at this scale.
Training Recipe: Pre-training involved three stages: 1) Initial pre-training with AdamW, cosine schedule, and scaled batch sizes/sequence lengths. 2) Long-context pre-training, gradually increasing context length up to 128K over ~800B tokens. 3) Annealing over the final 40M tokens with learning rate decay, upsampling high-quality data, and Polyak averaging.

Post-training (Alignment & Capabilities):

Process: Iterative rounds of SFT and DPO were applied using human and synthetic data, guided by a reward model (RM). A new multi-message chat protocol was introduced for complex interactions like tool use.
Modeling: RM trained on human preferences. SFT used rejection-sampled and synthetic data. DPO aligned models with preferences, using modifications like masking formatting tokens and adding NLL regularization for stability. Model averaging was used across stages.
Data: Human preference data (82% English, 7% Code, 5% Multilingual, 6% Reasoning/Tools) collected with 4-level preference strength ratings and optional editing. SFT data included rejection-sampled outputs (improved with PagedAttention), synthetic data for specific capabilities, and some human data. Data cleaning and pruning used rule-based filtering and model-based techniques (topic/quality/difficulty scoring, semantic deduplication).
Specific Capabilities:
- Code: Trained code expert, generated synthetic data using execution feedback, language translation, and backtranslation. Used system prompts for quality and execution-based filtering.
- Multilinguality: Trained multilingual expert, sourced data from human annotations, NLP tasks, rejection sampling, and translated reasoning data.
- Math/Reasoning: Sourced/generated prompts, added step-by-step traces, filtered incorrect reasoning using RMs/MCTS, interleaved code/text reasoning, used error correction feedback.
- Long Context: Generated synthetic SFT data (QA, summarization, code reasoning), mixed 0.1% long-context data into SFT. Used standard short-context DPO.
- Tool Use: Trained to use search, Python interpreter, Wolfram Alpha. Used human annotations with message-level feedback, bootstrapped with synthetic data. Improved zero-shot function calling using mined/synthetic data.
- Factuality: Developed knowledge probing technique to generate data encouraging refusal when unsure, used limited labeled data for sensitive topics.
- Steerability: Collected preference data for system prompt adherence (length, format, tone, persona), used data in alignment stages.

Evaluation Results:

Pre-trained: Llama 3 8B/70B outperform similar-sized models across standard benchmarks. 405B model is competitive with other frontier models and significantly better than previous open models. Models show robustness to MCQ setup variations (labels, order, prompt format). Adversarial benchmark performance varies by task. Contamination analysis shows variable impact depending on the benchmark.
Post-trained: Outperform competitors in respective size classes on general knowledge, instruction following, coding, multilingual tasks, math/reasoning, long context, and tool use benchmarks. 405B model is competitive with or surpasses GPT-4, GPT-4o, and Claude 3.5 Sonnet on many benchmarks.
Human Evaluations: Llama 3 405B performs comparably to GPT-4 (0125), with mixed results against GPT-4o and Claude 3.5 Sonnet across different capabilities, confirming its competitiveness.

Safety:

Approach: Integrated safety throughout development via pre-training data filtering, safety-focused finetuning (SFT/DPO) balancing violation rate (VR) and false refusal rate (FRR), and iterative red teaming. Custom benchmarks based on ML Commons taxonomy were used.
Mitigation: Used high-quality human and synthetic safety data, including borderline examples. Tailored safety data mix for different model sizes. Developed specific mitigations for multilingual, long-context (many-shot jailbreaking), and tool use risks.
Risk Assessment: CyberSecEval showed no significant susceptibilities in malicious code generation but some vulnerability to prompt injection and code interpreter abuse. Uplift studies for cyber attacks and chemical/biological weapons showed Llama 3 usage did not significantly increase capabilities compared to existing resources (web search).
Red Teaming: Identified risks through expert red teaming across capabilities (multi-turn refusal suppression, hypotheticals, personas, gradual escalation, multilingual mixing, unsafe tool chaining). Assessed child safety risks.
System-Level: Released Llama Guard 3 (8B safety classifier for inputs/outputs, supporting multilingual/tool use), Prompt Guard (attack detection), and Code Shield (insecure code detection). Llama Guard significantly reduces VR at the cost of increased FRR, offering configurable safety.

Inference:

Pipeline Parallelism: BF16 inference for 405B model uses 16 GPUs across 2 nodes with tensor parallelism intra-node and pipeline parallelism inter-node. Micro-batching improves throughput.
FP8 Quantization: Quantized FFN layers using FP8 with dynamic scaling factors and mitigations (skipping end layers, bounding scales, row-wise quantization) to maintain quality. Achieved up to 50% pre-fill throughput gain and better decode throughput-latency trade-off compared to BF16. Implementation released.

Multimodal Experiments (Vision & Speech):

Approach: Compositional integration of pre-trained encoders via adapters, keeping the LLM core unchanged for text tasks. Models still under development, not released.
Vision: Used a ViT-H based image encoder and cross-attention image adapter trained on billions of image-text pairs. Video capabilities added via temporal aggregator and video cross-attention layers. Addressed scaling challenges (heterogeneity, numerical stability). Post-training involved SFT, DPO, RM, rejection sampling, and quality tuning. Achieved competitive results on image (MMMU, VQAv2, DocVQA etc.) and video (PerceptionTest, NExT-QA, etc.) benchmarks.
Speech: Integrated a 1B-parameter Conformer encoder and an adapter feeding embeddings directly into the LLM. Supports ASR, AST, and spoken dialogue in 34 languages. Trained encoder self-supervisedly, then finetuned encoder+adapter with LLM frozen. Achieved state-of-the-art results on ASR/AST benchmarks. Demonstrated zero-shot multi-turn/code-switching dialogue. Developed a streaming TTS system leveraging Llama 3 embeddings for improved text normalization and prosody, enhancing naturalness.

Conclusion:

Llama 3 models represent a significant step forward, achieving performance comparable to leading closed models. Success was driven by focusing on data quality, scale, and simplicity. The open release of the 405B model aims to spur innovation and responsible AI development. Preliminary multimodal results are promising. Organizational factors were also key to success.

PDF Markdown Bookmark Chat (Pro)

Authors (531)

Abhimanyu Dubey (35 papers)
Abhinav Jauhri (4 papers)
Abhinav Pandey (9 papers)
Abhishek Kadian (9 papers)
Ahmad Al-Dahle (2 papers)
Aiesha Letman (1 paper)
Akhil Mathur (21 papers)
Alan Schelten (5 papers)
Amy Yang (3 papers)
Angela Fan (49 papers)
Anirudh Goyal (93 papers)
Anthony Hartshorn (6 papers)
Aobo Yang (8 papers)
Archi Mitra (1 paper)
Archie Sravankumar (2 papers)
Artem Korenev (2 papers)
Arthur Hinsvark (3 papers)
Arun Rao (2 papers)
Aston Zhang (48 papers)
Aurelien Rodriguez (5 papers)

Citations (1,756)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/ylecun/status/1840784800698843612

https://twitter.com/danielhanchen/status/1862979480274555297

https://twitter.com/soumithchintala/status/1841528932585132408

https://twitter.com/francoisfleuret/status/1905179953441755620

https://twitter.com/_akhaliq/status/1818842340099793192

https://twitter.com/labenz/status/1827744915775766739

The Llama 3 Herd of Models (2407.21783v1)

Related Papers

Tweets

YouTube

HackerNews

Reddit