Papers
Topics
Authors
Recent
Search
2000 character limit reached

Llama 2 Chat Drafter 115M Overview

Updated 7 April 2026
  • Llama 2 Chat Drafter 115M is a compact Transformer model with just 1.64% of the teacher model’s parameters, designed for fast and memory-efficient speculative decoding.
  • It employs a three-stage training pipeline—pretraining, synthetic distillation, and fine-tuning with knowledge distillation—to ensure alignment with Llama 2 Chat 7B.
  • Empirical results demonstrate improved block efficiency and token throughput, achieving up to 2.4× speed-up in diverse real-world tasks.

Llama 2 Chat Drafter 115M is a compact, highly parameter-efficient Transformer-based draft model designed to accelerate chat-capable LLM inference—most notably for Llama 2 Chat 7B and larger models—through speculative decoding. Unlike full-sized target models, it contains only approximately 1.64% of the parameter count, facilitating fast, memory-efficient speculative proposals that can be selectively verified by the teacher model. This paradigm shifts the compute and memory bottleneck during autoregressive decoding, directly enabling scalable LLM deployment in both server and edge-device settings (Goel et al., 2024, Zafrir et al., 2024, Ramakrishnan et al., 3 Jul 2025).

1. Architecture and Model Specification

Llama 2 Chat Drafter 115M is architected as a Transformer-based student model tailored to match the tokenization, vocabulary, and context window of the teacher model (Llama 2 Chat 7B). The architectural details, as per (Goel et al., 2024), are as follows:

Model Layers Hidden Size MLP Size Attention Heads Parameters
Llama 2-Chat-7B 32 2048 11,008 32 ~7B
Drafter-115M 4 1024 2816 8 ~115M (1.64%)

Both utilize the SiLU activation and share tokenization scheme and max position embeddings, ensuring backward compatibility for speculative decoding. Alternative pipelines such as FastDraft recommend similar constraints: 8 layers, hidden sizes 512–768, and 8–12 attention heads, also yielding ~115M parameter counts with full vocabulary compatibility (Zafrir et al., 2024). For practical deployment and on-device acceleration, quantized variants (int8/int4) further reduce resource requirements to well below 200MB total RAM with negligible accuracy loss (Ramakrishnan et al., 3 Jul 2025).

2. Training Methodologies

The canonical training pipeline involves three sequential stages: pretraining, distillation dataset creation, and fine-tuning with knowledge distillation (Goel et al., 2024). A detailed breakdown:

  • Pretraining: The drafter is trained from scratch on 600B tokens of de-duplicated, public English text (excluding Llama 2 proprietary data) using next-token prediction loss:

Lpre(θ)=E(x1,,xT)Dt=1Tlogpθ(xtx<t)L_\mathrm{pre}(\theta) = -\mathbb{E}_{(x_1,\dotsc,x_T)\sim D} \sum_{t=1}^T \log p_\theta(x_t | x_{<t})

(AdamW optimizer, linear warmup/decay, DeepSpeed distributed training).

  • Distillation Data Generation: Generation of instruction–response pairs by prompting the target Llama 2 Chat-7B over public datasets such as OIG-small-chip2, OpenAssistant, Alpaca, and others. Diverse decoding settings (temperatures 0.0, 0.3, 0.7, 1.0; top-p=0.95) are employed to cover plausible output distributions. Data is optionally filtered for length, perplexity, and deduplicated (Zafrir et al., 2024).
  • Fine-Tuning (Knowledge Distillation): Finetuning uses a 1:9 mixture of synthetic distillation samples and pretraining corpus per batch, aligned with either white-box or sequence-level knowledge distillation. The process is monitored via block efficiency, and typically concludes after a few thousand steps (Goel et al., 2024, Zafrir et al., 2024).
Stage Data Loss Remarks
Pretrain Public corpus (600B tokens) Cross-entropy Excludes proprietary Llama-2 data
Distillation Teacher-generated pairs Diverse temp/top-p, filtering
Distill FT Mixture (1:9) TVD++/CE White-box or token-level divergence

3. Loss Functions and Alignment Objectives

Alignment of the drafter model to the Llama 2 Chat teacher exploits both conventional and newly introduced losses, selected for their effect on speculative decoding acceptance rates:

  • Total Variation Distance (TVD):

TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|

  • TVD++ Loss: TVD++ augments standard TVD with variance reduction (advantage normalization):

θTVD++(pθ,q)=1ni=1nθlogpθ(yix)r(yi)μσ\nabla_\theta \mathrm{TVD}^{++}(p_\theta, q) = \frac{1}{n} \sum_{i=1}^n \nabla_\theta \log p_\theta(y_i|x) \frac{r(y_i) - \mu}{\sigma}

where r(y)=1{q(yx)>pθ(yx)}r(y) = \mathbf{1}\{q(y|x) > p_\theta(y|x)\}, with μ,σ2\mu, \sigma^2 as batch mean/variance. This formulation is inspired by policy-gradient methods and stabilizes the gradient estimates (Goel et al., 2024).

  • Alternative Losses: Token-level Kullback–Leibler divergence (KL) and sequence-level cross-entropy are also employed in certain pipelines, sometimes in a hybrid (KL/NLL) configuration, especially for online or adaptive settings (Zafrir et al., 2024, Ramakrishnan et al., 3 Jul 2025).

Ablation studies confirm that TVD++ yields superior empirical acceptance/block efficiency during speculative decoding relative to both TVD and KL losses (Goel et al., 2024).

4. Speculative Decoding Integration

Speculative decoding leverages the drafter to propose γ\gamma tokens per block, with the teacher model verifying and accepting as many as possible. The decoding proceeds as follows (Goel et al., 2024):

  1. The drafter proposes γ\gamma tokens y^1:γpθ(x)\,\hat{y}_{1:\gamma} \sim p_\theta(\cdot|x).
  2. The teacher computes for each prefix the ratio q(y^ix,y^<i)pθ(y^ix,y^<i)\frac{q(\hat{y}_i|x, \hat{y}_{<i})}{p_\theta(\hat{y}_i|x, \hat{y}_{<i})} and samples uU[0,1]u\sim U[0,1]; tokens are accepted if TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|0 this ratio.
  3. Accepted tokens are appended to the output, and the context updated. The loop repeats until generation ends.

This speculative workflow is compatible with both pure teacher–drafter pairs and more advanced, cross-vocabulary, or adaptive settings (e.g., OmniDraft), provided embedding alignment and token mapping strategies are in place (Ramakrishnan et al., 3 Jul 2025).

5. Performance Metrics and Empirical Results

Critical speculative decoding metrics as defined and measured by (Goel et al., 2024, Zafrir et al., 2024):

  • Acceptance Rate TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|1: Fraction of drafter tokens accepted by the teacher per TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|2-token draft.
  • Block Efficiency TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|3: Mean tokens advanced per draft–verify cycle (maximum possible TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|4).
  • Memory-Bound Speed-Up (MBSU): TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|5 with TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|6. Quantifies memory bandwidth–limited acceleration versus pure autoregressive decoding.
  • Token Rate / Wall-clock Speed-Up: Empirically measured tokens/sec improvement.

Empirically, Llama 2 Chat Drafter 115M can achieve:

Task TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|7 TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|8 (block eff.) MBSU (×) Max Speed-Up (×)
Dolly 3 ~2.3 2.4
CNN/DailyMail 3 ~2.4 2.2
XSum 3 ~2.1 2.0

Block size TVD(pθ,q)=12yq(yx)pθ(yx)\mathrm{TVD}(p_\theta, q) = \frac{1}{2} \sum_y |q(y|x) - p_\theta(y|x)|9–5 is standard, with acceptance rates approximately 35–40%, yielding 1.5–2.4× real-world speedup with no observed degradation in chat/task performance on in-distribution evaluation (Goel et al., 2024, Zafrir et al., 2024, Ramakrishnan et al., 3 Jul 2025).

6. Variants, Extensions, and Deployment Considerations

Llama 2 Chat Drafter 115M serves as the prototypical compact drafter for Llama 2 speculative decoding, but alternative workflows and extension frameworks enable broader deployment:

  • FastDraft: Provides a generalizable recipe to train draft models for any LLM with minimal hardware, emphasizing exact vocabulary/embedding compatibility, optional continued pretraining for code/text domains, and practical design for mixed-precision inference on AI-PC or edge devices (Zafrir et al., 2024).
  • OmniDraft: Addresses cross-tokenizer/vocabulary-mismatch challenges, introducing an n-gram cache for mapping drafter to teacher tokens, hybrid on-policy distillation (combining KL and NLL), and adaptive drafting for dynamic block-sizing. On-device updating enables continuous personalization and further speedup (1.5–2×), with memory requirements staying under 200MB (quantized) (Ramakrishnan et al., 3 Jul 2025).
  • In-Context Alignment: Vanilla Llama 2 115M can be used as a chat-style drafter without fine-tuning, leveraging retrieval-augmented in-context learning from a database of a few thousand aligned prompt–response pairs (K≈6–9 per prompt), producing a 7× win-rate improvement over direct prompting versus text-davinci-003 for larger models (Han, 2023).

7. Limitations and Future Research Directions

Several limitations are present in the current Llama 2 Chat Drafter 115M paradigm:

  • Out-of-Distribution (OOD) Generalization: Significant performance degradation is observed on tasks such as WMT18 (translation) when synthetic distillation does not cover the task domain, indicating the necessity of targeted synthetic or human-labeled data.
  • Static Block Sizing: Fixed speculative block size γ limits optimality; adaptive drafting (as in OmniDraft) promises further efficiency.
  • Synthetic Distillation Data Dependency: Over-reliance on teacher-generated synthetic data can limit generalization; supplementing with human annotation or diverse real-world samples is an open research direction.
  • Variance Reduction and Loss Functions: Continued development of improved loss functions—potentially incorporating reward modeling or control-variates beyond advantage normalization—remains an area of active investigation.

Emerging research also suggests employing multi-drafter cascades and further compression/quantization techniques to deploy Llama 2 Chat Drafter 115M–style models on increasingly constrained hardware (Goel et al., 2024, Zafrir et al., 2024, Ramakrishnan et al., 3 Jul 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Llama 2 Chat Drafter 115M.