Llama-2-chat Model Overview

Updated 7 April 2026

Llama-2-chat is an open-source conversational LLM that uses a decoder-only transformer architecture fine-tuned via supervised learning and RLHF for enhanced multi-turn dialogue.
It employs a multi-stage fine-tuning pipeline—including supervised fine-tuning, reward modeling, and reinforcement learning from human feedback—to optimize performance and safety.
The model integrates advanced behavioral steering techniques like Contrastive Activation Addition to adapt its responses and specialize in various domains.

Llama-2-chat is an open-source conversational LLM family developed by Meta, representing a series of decoder-only transformer architectures fine-tuned to optimize for multi-turn dialogue, instruction following, and safety. Building on the pre-trained Llama 2 base models, Llama-2-chat employs both supervised fine-tuning and reinforcement learning from human feedback (RLHF), resulting in a chat-optimized LLM that approaches or surpasses the conversational quality of closed-source systems on multiple benchmarks. The model suite—spanning scales from 7 billion to 70 billion parameters—serves as both a foundation for further research and a platform for practical deployment in academic and industrial settings (Touvron et al., 2023).

1. Model Architecture and Pretraining

Llama-2-chat adopts a pre-normalized, auto-regressive transformer backbone, leveraging RMSpNorm, SwiGLU activations, rotary positional embeddings (RoPE), and grouped-query attention (GQA) in its largest variants. Model sizes and architectural details are as follows:

Model Size	Layers	Hidden Dim	Attention Heads	Context Window
7B	32	4096	32	4096
13B	40	5120	40	4096
70B	80	8192	64 (GQA)	4096

Pretraining is performed on public corpora (web, code, books, Wikipedia) totaling approximately 2T tokens. Tokenization employs a 32K subword SentencePiece/BPE vocabulary. The next-token prediction objective is the standard autoregressive loss: $\mathcal{L}_{LM} = -\sum_{t=1}^T \log p_\theta(x_t|x_{<t})$ Training is conducted with AdamW, a cosine learning rate schedule, and upsampling of high-factuality data to minimize hallucination (Touvron et al., 2023).

2. Fine-Tuning, Alignment, and Safety Mechanisms

Llama-2-chat is distinguished by a multi-stage fine-tuning pipeline:

Supervised Fine-Tuning (SFT): The base model is first trained on 27,540 human-written dialogue pairs. The objective is masked so that only assistant responses, not user prompts, contribute to the loss (Touvron et al., 2023).
Reward Modeling: Human annotators provide ∼2.9M binary comparisons along "helpfulness" and "safety" axes, used to train specialized reward models via margin-augmented pairwise ranking.
Reinforcement Learning from Human Feedback (RLHF): The PPO algorithm is employed with a composite reward:

$R(g|p) = \mathrm{logit}(R_c(g|p)) - \beta D_{KL}(\pi_\theta(\cdot|p) \Vert \pi_0(\cdot|p))$

where $R_c$ is the contextually-adapted reward (helpfulness vs. safety), and $\beta$ regularizes policy divergence. Both rejection sampling and PPO stages are used (Touvron et al., 2023).

System Message Consistency (Ghost Attention): System prompts (e.g., instruction to answer in haiku or a certain language) are prepended to every turn but their loss is masked out, enhancing multi-turn adherence.
Safety and Red Teaming: Fine-tuning incorporates adversarial ("red team") prompts, safety RLHF with dedicated reward models, and context distillation via safety pre-prompts (Touvron et al., 2023).

3. Behavioral Steering by Contrastive Activation Addition (CAA)

Contrastive Activation Addition (CAA) provides a model-agnostic mechanism to steer Llama-2-chat toward or away from complex behaviors at inference. The method is as follows (Panickssery et al., 2023):

Steering Vector Computation: Define paired prompt sets $P$ (positive) and $N$ (negative), differing only in the answer token but isolating a target behavior (e.g., refusal, sycophancy). Compute mean-difference steering vector at layer $L$ :

$\Delta_L = \frac{1}{|P|} \sum_{x\in P} r_L(x) - \frac{1}{|N|} \sum_{x\in N} r_L(x)$

where $r_L(x)$ is the residual stream activation.

Inference-Time Intervention: At every token $t$ after the user prompt, inject:

$R(g|p) = \mathrm{logit}(R_c(g|p)) - \beta D_{KL}(\pi_\theta(\cdot|p) \Vert \pi_0(\cdot|p))$ 0

with $R(g|p) = \mathrm{logit}(R_c(g|p)) - \beta D_{KL}(\pi_\theta(\cdot|p) \Vert \pi_0(\cdot|p))$ 1 controlling steering direction and magnitude.

Empirical Results: CAA robustly alters category-specific behaviors (e.g., hallucination, refusal, sycophancy) with minimal capability loss (<2% perplexity change or ≤1% accuracy shift on MMLU). Steering is most effective at mid-model layers (13 for 7B, 14–15 for 13B). CAA is orthogonal and additive to system prompting and supervised fine-tuning (Panickssery et al., 2023).
Interpretability: PCA reveals that contrastive activation directions become linearly separable by behavior in mid-layers. Cosine similarity between residual-token representations and $R(g|p) = \mathrm{logit}(R_c(g|p)) - \beta D_{KL}(\pi_\theta(\cdot|p) \Vert \pi_0(\cdot|p))$ 2 correlates with semantic content. Cross-layer and cross-model analysis shows steering vectors are robust and transferable.

4. Bias, Refusal, and Societal Impact

Llama-2-chat, even after RLHF, demonstrates persistent and nuanced social biases, notably in gender, race, and religion. Activation steering, especially when combining bias and refusal vectors, exposes and manipulates these latent behaviors (Lu et al., 2024). Steering vectors for different biases are defined analogously to those for behaviors: $R(g|p) = \mathrm{logit}(R_c(g|p)) - \beta D_{KL}(\pi_\theta(\cdot|p) \Vert \pi_0(\cdot|p))$ 3 where $R(g|p) = \mathrm{logit}(R_c(g|p)) - \beta D_{KL}(\pi_\theta(\cdot|p) \Vert \pi_0(\cdot|p))$ 4 and $R(g|p) = \mathrm{logit}(R_c(g|p)) - \beta D_{KL}(\pi_\theta(\cdot|p) \Vert \pi_0(\cdot|p))$ 5 are mean activations for stereotype/anti-stereotype sets at layer $R(g|p) = \mathrm{logit}(R_c(g|p)) - \beta D_{KL}(\pi_\theta(\cdot|p) \Vert \pi_0(\cdot|p))$ 6.

Key observations include:

Baseline Llama-2-chat 7B refuses nearly all prompts designed to elicit racial/religious stereotypes; gender bias remains observable. Applying a pure bias vector alone increases refusal rather than surface-level bias content.
Simultaneous subtraction of a refusal vector reveals explicit bias; notably, RLHF collapses previously distinct bias axes in activation space, as measured by high cosine similarity across late layers.
Practical mitigation requires tuning $R(g|p) = \mathrm{logit}(R_c(g|p)) - \beta D_{KL}(\pi_\theta(\cdot|p) \Vert \pi_0(\cdot|p))$ 7 values, careful dataset curation, and norm-preserving interventions.

Activation steering thus enables effective red-teaming (probing unsafe model outputs) and specific mitigation strategies, but highlights challenges in decoupling bias representations post-RLHF (Lu et al., 2024).

5. Symbolic Reasoning and Scaling Trends

Llama-2-chat models exhibit emergent symbolic reasoning as size increases, assessed via synthetic tasks (ListOps, arithmetic with modular operations) (Petruzzellis et al., 2024).

Aggregate exact-match accuracies for 7B/13B/70B chat models are 50%/60%/67% (ListOps) and 20%/25%/30% (Arithmetic).
Fine-tuned models (e.g., MetaMath 70B) can further boost ListOps accuracy to 84%, but remain challenged by formulas of high nesting ( $R(g|p) = \mathrm{logit}(R_c(g|p)) - \beta D_{KL}(\pi_\theta(\cdot|p) \Vert \pi_0(\cdot|p))$ 8), where even 70B models drop below 50%.
Failure to generalize on deeply compositional structure suggests a limit to current pretraining and instruction tuning paradigms; future methods may require explicit decomposition or neuro-symbolic integrations (Petruzzellis et al., 2024).

6. Domain Specialization and Downstream Adaptation

Domain-specialized Llama-2-chat variants (e.g., AstroLLaMA-Chat) are produced by continual pretraining of the base model on curated corpora (e.g., astrophysics papers) followed by supervised instruction tuning on domain-specific Q&A. Such models demonstrate (Perkowski et al., 2024):

Reduced perplexity and increased exact-match accuracy on the target domain relative to the base model.
Improved recall of highly specialized concepts, although hallucination and multi-turn limitations persist.
Simple recipes—domain-specific pretraining followed by SFT on style-appropriate dialogues—are sample-efficient, competitive for low-to-mid-scale LLMs, and do not require RLHF or ranking loss.

7. Inference-Time and In-Context Alignment

In-context alignment methods demonstrate that a vanilla (unfine-tuned) Llama-2 model can achieve near-Llama-2-chat quality by prepending a handful of ChatGPT-distilled prompt–response pairs, retrieved for topical and stylistic relevance (Han, 2023).

This approach yields a 7× win-rate improvement versus zero-shot Llama-2-vanilla and approaches fine-tuned Llama-2-chat on standard chat benchmarks, without any modification to model weights.
The method is efficient, interpretable, and flexible; alignment can be reconfigured in real-time by swapping demo pools.
However, it is limited in multi-turn and long-context chat, and is only as robust as the demo set.

A plausible implication is that substantial aspects of conversational alignment can be induced at inference, separating "style transfer" from the need for costly RLHF or large-scale fine-tuning.

In summary, Llama-2-chat is a family of scaled, open-source transformer models optimized for chat via a combination of SFT, RLHF, safety data, and system-prompting mechanisms. Recent research has elucidated mechanisms for precise behavioral steering (CAA, activation steering), analyzed the persistence and manipulability of societal biases, charted model scaling trends in symbolic reasoning, and demonstrated efficient adaptation to new domains and operating regimes. Continued advances in interpretability, steering, and alignment methodologies remain central challenges as the Llama-2-chat suite is integrated into safety-critical and knowledge-intensive applications (Touvron et al., 2023, Panickssery et al., 2023, Petruzzellis et al., 2024, Lu et al., 2024, Perkowski et al., 2024, Han, 2023).