Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 70 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

LLaMA-2 & LLaMA-3: Evolution and Scaling

Updated 27 September 2025
  • LLaMA-2 and LLaMA-3 are open-source, large-scale autoregressive Transformer models known for their dense and sparse architectures and parameter ranges from 7B to over 400B.
  • They integrate advanced scaling techniques, such as extended context windows up to 128K tokens, RoPE adjustments, and innovations like LlamaPro and MSG for depth/width improvements.
  • Domain adaptation and fine-tuning methods—including RLHF, LoRA, and specialized tokenization—enable these models to excel in tasks like code generation, multimodality, and safety reinforcement.

The LLaMA-2 and LLaMA-3 families comprise a series of open-source, large-scale autoregressive LLMs introduced by Meta and subsequently advanced by the broader research community. These families include dense and sparsely activated Transformer architectures, span a wide range of parameter counts (from 7B to over 400B), and have been trained with varying optimization strategies, data curation protocols, and multilingual as well as domain-specific applications. Their design emphasizes openness, extensibility, efficient scaling, and strong empirical performance across general, code, multimodal, and specialized scientific tasks.

1. Architectural Evolution and Model Scaling

The LLaMA-2 models utilize a standard dense Transformer decoder architecture, with key architectural features including pre-norm Transformer blocks, rotary positional embeddings, SwiGLU activations, and Grouped-Query Attention (GQA) in large variants. LLaMA-2 extends context window length from 2048 to 4096 tokens and optimizes inference efficiency by employing GQA in the 34B and 70B models, reducing memory demands without significant performance loss.

LLaMA-3 continues the dense Transformer paradigm but substantially expands scaling, offering models at 8B, 70B, 102B (Motif), and a flagship 405B parameters. Innovations include even longer context windows (up to 128K tokens via RoPE base adjustment to 500,000), aggressive multilingual data integration, more refined document-level attention masking, and architecture-level optimizations for batch and inference throughput. Scaling techniques such as LlamaPro (for depth expansion) and Masked Structure Growth (MSG, for width expansion) allow progressive, stable scaling without catastrophic forgetting (Lim et al., 4 Sep 2025).

A summary of selected scale and architecture parameters is given below:

Model Family Parameter Range Context Window Attention Expansion Methods
LLaMA-2 7B–70B 4096 tokens GQA (large) Core transformer
Code Llama 7B–70B up to 100K GQA, FIM Infilling tweaks
LLaMA-3 8B–405B+ up to 128K GQA, RoPE Depth/width scaling

Extensions such as LLaMA-Pro expand pre-trained LLaMA-2 via block interleaving. Mixture-of-Experts configurations (LLaMA-MoE) partition FFN weights, enabling sparse activation and decoupling model capacity from inference cost (Zhu et al., 24 Jun 2024).

2. Fine-Tuning, Post-Training, and Domain Adaptation

LLaMA-2 and its descendants are typically released as both base and fine-tuned “chat” or “instruct” models. LLaMA-2-Chat is produced using a two-stage procedure: supervised fine-tuning (SFT) on dialogue-style data, followed by reinforcement learning from human feedback (RLHF), where reward models for helpfulness and safety are optimized via pairwise ranking loss:

Lranking=logσ(rθ(x,yc)rθ(x,yr)m(r))\mathcal{L}_{\mathrm{ranking}} = -\log \sigma(r_\theta(x, y_c) - r_\theta(x, y_r) - m(r))

and PPO-style policy optimization with KL regularization:

R(gp)=R~c(gp)βDKL(πθ(gp)π0(gp))R(g | p) = \tilde{R}_c(g | p) - \beta D_\mathrm{KL}(\pi_\theta(g|p) \parallel \pi_0(g|p))

LLaMA-3 post-training includes additional steps for tool use, code, and reasoning, as well as precise domain adaptation pipelines (e.g., radiology and oncology), often utilizing parameter-efficient fine-tuning (LoRA, QLoRA) for local or resource-limited environments (Hou et al., 20 Aug 2024, Shi et al., 13 Aug 2024). Domain-specific LLaMA variants have been introduced for code (Code Llama), Tamil (Tamil-Llama), Korean (Motif), and specialized scientific domains (protein alignment, chemistry, malware detection) (Lim et al., 4 Sep 2025, Shu et al., 8 Nov 2024, Sun et al., 16 Mar 2025, O et al., 5 Nov 2024).

3. Empirical Benchmarks and Performance

LLaMA-2 and its code-specialized or augmented derivatives demonstrate strong results across academic benchmarks:

  • General Language: MMLU, Big Bench Hard, ARC, HellaSwag, TriviaQA, GSM8K.
  • Reasoning and Math: MATH, GSM8K, Program-of-Thought.
  • Code Generation: HumanEval, MBPP, MultiPL-E (Code Llama achieves up to 67% on HumanEval; 7B variant outperforms LLaMA-2 70B on Python) (Rozière et al., 2023).
  • Multilingual and Non-English: Extended vocabulary and balanced data sampling in Tamil-Llama and Llama-3-Motif yield superior performance on Indic and Korean-specific benchmarks (Balachandran, 2023, Lim et al., 4 Sep 2025).
  • Domain-Specific: Fine-tuned LLaMA-3 models for radiology generate clinically relevant impressions with BERTScore F1 ≈ 0.88 and ROUGE-L ≈ 0.29, outperforming generic LLMs on domain-specific assessment (Shi et al., 13 Aug 2024).
  • Cybersecurity: SFT-fine-tuned LLaMA-3 (8B) attains 94% accuracy with a 4% false positive rate on DGA domain detection, outperforming conventional LSTM+attention models (O et al., 5 Nov 2024).
  • Chemistry: SynLlama (LLaMA-3 based) efficiently generates retrosynthetic pathways and analogs, reconstructing unseen molecules with high fingerprint similarity using 10–100× less data than prior methods (Sun et al., 16 Mar 2025).

4. Specialization, Adaptation, and Multimodality

LLaMA-family models serve as adaptable backbones for specialized tasks. Methods include:

  • Vocabulary expansion and domain-centric tokenization (e.g., 16,000 new tokens for Tamil) for efficiency, reduced token count, and fidelity (Balachandran, 2023).
  • Label-supervised adaptation (projecting output token representations into low-cardinality label spaces) increases performance for classification over instruction-tuned LLMs; removal of causal masks allows state-of-the-art token classification (NER) (Li et al., 2023).
  • Integration of image, video, and speech modalities is achieved compositionality (adapters for ViT and audio encoders), preserving text-only performance while supporting vision–language and speech tasks (Grattafiori et al., 31 Jul 2024).
  • LLaMA-MoE constructs MoE models by partitioning FFN weights, using neuron sharing and continual pre-training, allowing efficient scaling and expert specialization for particular data sources (Zhu et al., 24 Jun 2024).

Protein-focused multimodal LLMs combine LLaMA-3 text encoders and geometric deep models (GearNet, ScanNet) using contrastive learning losses:

Ltotal=1Bi=1Blog(e(sim(gi,ti)+1)/(2τ)e(sim(gi,ti)+1)/(2τ)+jie(sim(gi,tj)+1)/(2τ))\mathcal{L}_{\mathrm{total}} = \frac{1}{B} \sum_{i=1}^B -\log \left( \frac{e^{(\text{sim}(g_i, t_i)+1)/(2\tau)}}{e^{(\text{sim}(g_i, t_i)+1)/(2\tau)} + \sum_{j\neq i} e^{(\text{sim}(g_i, t_j)+1)/(2\tau)}} \right)

where sim(g,t)\text{sim}(g, t) is normalized cosine similarity. Alignment is strengthened by large embedding dimensions, multi-layer projection heads, and LLM fine-tuning on protein text (Shu et al., 8 Nov 2024).

5. Safety, Alignment, and Evaluation

Safety is prioritized in both pre- and post-training, with rigorous data filtering, adversarial prompt evaluation, and human preference feedback. LLaMA-2 and LLaMA-3 employ reward models for helpfulness and safety, Likert-style annotation on adversarial prompts, and open release of evaluation protocols (Touvron et al., 2023, Grattafiori et al., 31 Jul 2024). The Llama Guard 3 classifier acts as a deployable input/output filter, targeting categories like hate, defamation, and dangerous advice.

Studies of model internals (e.g., "Forbidden Facts") reveal that suppression of forbidden information is distributed among dozens of components—primarily attention heads and MLPs—with patchy, heuristic-based mechanisms vulnerable to adversarial triggers (e.g., the "California Attack") (Wang et al., 2023). This highlights both the difficulty of interpreting how large models balance safety and truthfulness and the brittleness of current alignment strategies.

6. Openness, Community, and Impact

Both LLaMA-2 and LLaMA-3 families emphasize open science: release of model weights, code, reproducible training/fine-tuning recipes, and responsible usage guidelines. Licenses allow for both academic and commercial usage (notably for Code Llama and LLaMA-3). Community-driven efforts have resulted in language-specialized (Tamil-Llama, Llama-3-Motif), domain-adapted (MGH Radiology Llama, SynLlama), and methodologically innovative variants (LLaMA-MoE, LLaMA-Pro).

The result is a foundation for broad research in model scaling, efficient adaptation (via LoRA, QLoRA, block expansion), robust domain adaptation, and multimodal AI. By reducing the technical and licensing barriers for large model experimentation, the LLaMA families have catalyzed the proliferation of open, reproducible research trajectories in foundation models, their fine-tuning, and specialized deployment.

7. Open Questions and Future Directions

The LLaMA-2 and LLaMA-3 series provide a basis for scalable and specialized LLM research, but several areas remain active topics of investigation:

  • Interpretability: Components responsible for safety and alignment are heuristically distributed and vulnerable, complicating reliable understanding and robust control (Wang et al., 2023).
  • Multimodal Expansion: While preliminary adapters for vision and speech exist, fully integrated, robustly evaluated multimodal foundation models are still under development (Grattafiori et al., 31 Jul 2024).
  • Data and Scaling: Future models will likely leverage even larger, higher-quality, and more balanced datasets, with systematic advances in efficient scaling (progressive depth/width, sparse Mixture-of-Experts).
  • Application Domains: As demonstrated by medical, cybersecurity, and scientific LLaMA variants, low-barrier adaptation for high-stakes or domain-constrained use cases will drive the next frontier.
  • Safety and Evaluation: Advances in adversarial evaluation, jailbreak detection, and input/output guardrails (Llama Guard 3) must keep pace with evolving model capacity and open deployment.

A plausible implication is that the continued synthesis of scaling methods, domain-specific adaptation, efficient fine-tuning, and robust safety engineering will define the progression of LLaMA-family models and their impact on both fundamental research and practical applications.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LLaMA-2 and LLaMA-3 Families.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube