Granite 3.1 2B Instruct: Compact On-Device AI

Updated 7 April 2026

Granite 3.1 2B Instruct is a compact instruction-following language model built on a decoder-only Transformer architecture, emphasizing high-quality data filtering and alignment.
The model utilizes a two-phase data curation process that combines automated LLM heuristics with synthetic multi-step reasoning examples to boost overall accuracy and safety.
Optimized for on-device use, Granite 3.1 2B Instruct deploys quantization and resource-efficient techniques, achieving competitive benchmark scores with minimal compute overhead.

Phi-3.5 Mini Instruct is a 3.8-billion-parameter generative LLM in the Phi-3.5 series, developed as an open-source, instruction-following model with a strong emphasis on data quality, alignment, and on-device deployment. It implements the decoder-only Transformer architecture and is designed to deliver capabilities approaching or exceeding those of larger models within a highly efficient parameter and compute budget. The model is a direct successor to prior Phi releases and is constructed and evaluated according to rigorous scaling, filtering, and alignment methodologies (Abdin et al., 2024).

1. Architectural Overview

Phi-3.5 Mini utilizes a 32-layer decoder-only Transformer with 32 self-attention heads per layer and a hidden dimension of 3072. Tokenization is based on Llama-2’s vocabulary, encompassing 32,064 unique tokens. Positional encoding uses Rotary Position Embeddings (RoPE). The maximum context window starts at 4,096 tokens and is extended to 128,000 tokens during training via the LongRoPE approach. Training is performed in bfloat16 precision over 3.3 trillion tokens, matching the architecture of Phi-3 Mini (Abdin et al., 2024).

2. Pretraining Corpus and Data Filtering

The pretraining process proceeds in two sequential phases. Phase 1 leverages a broad, publicly available web crawl subjected to automated scoring and "educational level" filtering via LLM heuristics to remove low-quality or excessively topical data (e.g., daily sports scores). Phase 2 uses a tighter subset of the web crawl and incorporates LLM-generated synthetic data focused on multi-step reasoning, code synthesis, and mathematics. Filters preferentially preserve conceptual and reasoning-focused material. Synthetic data is generated by larger LLMs to explicitly teach chains of thought in logic, math, and code (Abdin et al., 2024).

3. Instruction Tuning and Alignment Pipeline

Post-training, Phi-3.5 Mini is converted to an instruction-following model through a two-stage process:

Supervised Fine Tuning (SFT): The model is trained using curated, multi-domain instruction–response pairs spanning math, coding, logical reasoning, conversational dialogue, model identification, and safety prompts (restricted to English). The prompt format introduces dedicated <|user|> and <|assistant|> tokens and adheres to a chat-style protocol. The objective is next-token cross-entropy:

$\mathcal{L}_{CE} = -\sum_{t}\log p_\theta(y_t | y_{<t}, x)$

Direct Preference Optimization (DPO): The model is further trained on paired responses (“preferred” vs. “rejected”), encompassing safety-related and out-of-distribution prompts. The pairwise logistic objective, as in Ouyang et al., 2022, is minimized to increase the likelihood of producing preferred completions.

Alignment datasets are constructed from public human preference benchmarks (including Bai et al. 2022 and Ji et al. 2023) and an internal library of Responsible AI (RAI) scenarios. Safety alignment is enhanced through iterative adversarial red-teaming, where multi-turn adversarial prompts yield new training examples for continued DPO-fine-tuning. Automated RAI benchmarks indicate significant reductions in harmful outputs. The final model format removes unused tokens, standardizes assistant persona guidelines, and is robust to ambiguous and adversarial prompts (Abdin et al., 2024).

4. Quantitative Evaluation and Performance

Phi-3.5 Mini demonstrates competitive accuracy across academic and open-source benchmarks at a parameter scale suitable for on-device use. Notable results (all with 128K context window unless specified) include:

MMLU (5-shot): 69.0%
MMLU-Pro (0-shot + CoT): 47.5%
HellaSwag: 69.4%
BoolQ (2-shot): 78.0%
ARC-Challenge (10-shot): 84.6%
GSM-8K (8-shot CoT): 86.2%
HumanEval (0-shot code gen): 61.5%
MBPP (3-shot code gen): 68.6%
TruthfulQA (10-shot): 64.0%
RepoQA (128K ctx): 77.0%
RULER (128K ctx): 84.1%
Overall average (30+ tasks): 61.1%

Sigma models such as Llama 3.1-8B, Mixtral-8x7B, Mistral-7B, and Gemma-2-9B, as well as Gemini-1.5-Flash and GPT-4o-mini, are referenced for contextual comparison. Phi-3.5 Mini matches or exceeds the performance of models up to twice its size on multiple major evaluation metrics (Abdin et al., 2024).

5. Scaling Experiments and Data Efficiency

Comprehensive scaling experiments are conducted across the Phi family—Phi-1.5 (1.5B), Phi-2 (2.7B), Phi-3-Mini (3.8B), Phi-3-Small (7B), and Phi-3-Medium (14B)—demonstrating that careful interactive filtering combined with synthetic data generation accelerates error reduction with respect to model size. The scaling law observed follows:

$L(N, D) \simeq A N^\alpha D^\beta$

where $N$ is model size and $D$ is dataset size. Phi-series models are observed to lie near the empirical “compute-optimal frontier” established by Kaplan et al. (2020) and Hoffmann et al. (2022) when using high-quality data. Notably, increasing model size from 2.7B to 3.8B yields substantial accuracy improvements, but scaling from 7B to 14B yields diminishing returns in some metrics, suggesting further training pipeline refinements or data curation may be required for larger regimes (Abdin et al., 2024).

6. Deployment and Resource Optimization

Phi-3.5 Mini is engineered for practical on-device deployment. 4-bit quantization reduces the model checkpoint size to approximately 1.8 GB. This compact footprint enables fully offline inference on consumer-grade hardware such as the iPhone 14 (A16 Bionic), where throughput exceeds 12 tokens/sec natively. No structured pruning is applied (as in SlimMoE or related methods). Instead, Phi-3.5 Mini relies on quantization and block-sparse attention to achieve both memory and compute efficiency. The model can be further fine-tuned or aligned on personal or academic hardware with minimal overhead (Abdin et al., 2024).

7. Position within the Phi-3.5 Model Family

Phi-3.5 Mini is the smallest variant in the Phi-3.5 lineup, which includes Phi-3.5-MoE (a 16-expert mixture-of-experts architecture, 41.9B total/6.6B active params) and Phi-3.5-Vision (a 4.2B parameter multimodal model). While Phi-3.5-MoE and its SlimMoE-compressed derivatives (e.g., Phi-mini-MoE) employ advanced pruning and staged distillation to fit resource constraints (Li et al., 23 Jun 2025), Phi-3.5 Mini achieves compactness through architectural and filter-based design from the outset. All Phi-3.5 models emphasize high leverage of filtered web and synthetic data and extensive alignment, but only the MoE-derived SlimMoE models employ expert slimming, staged knowledge distillation, and aggressive parameter reduction for ultra-compact deployment (Abdin et al., 2024, Li et al., 23 Jun 2025).

References

Phi-3 Technical Report: A Highly Capable LLM Locally on Your Phone (Abdin et al., 2024)
SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation (Li et al., 23 Jun 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (2024)

SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Granite 3.1 2B Instruct.