Phi-1.5: Synthetic Data LLM

Updated 2 October 2025

Phi-1.5 is a 1.3B parameter Transformer-based model that leverages textbook-quality synthetic data to enhance reasoning, mathematics, and basic coding.
It employs 24 Transformer layers with 32 attention heads, rotary embeddings, and flash attention to efficiently process sequences of up to 2048 tokens.
Benchmark evaluations indicate that phi-1.5 competitively matches or exceeds larger models in common sense reasoning, arithmetic, and code generation while reducing toxic content.

Phi-1.5 refers to a 1.3 billion parameter Transformer-based LLM developed as an extension of the phi-1 series, designed to demonstrate the efficacy of high-quality synthetic “textbook-like” data for enhancing small model capabilities. The model achieves performance comparable to or surpassing much larger LLMs in common sense reasoning, grade-school mathematics, and basic code generation, without relying on traditional web-sourced data (Li et al., 2023).

1. Model Architecture and Configuration

Phi-1.5’s architecture is identical to phi-1. It consists of 24 Transformer layers, each employing 32 attention heads (each of dimension 64), yielding a total model size of 1.3B parameters. Notable implementation details are:

@@@@6@@@@ are used (rotary dimension 32).
The context window supports sequences of up to 2048 tokens.
The tokenizer is adopted from the codegen-mono model.
Training employs flash attention to improve throughput and memory efficiency.

The model parameters and training setup are summarized in the following table:

Parameter	Value	Purpose
Layers	24	Model depth
Attention Heads	32	Multi-head self-attention
Head Dimension	64	Dimensionality per head
Parameter Count	1.3B	Model capacity
Rotary Embedding	32	Enhanced positional encoding
Context Window	2048	Max input/output length

2. Training Data and Methodology

Training is conducted from random initialization using a constant learning rate of $2 \times 10^{-4}$ , weight decay 0.1, and the Adam optimizer (momentum parameters 0.9 and 0.98, epsilon $1 \times 10^{-7}$ ). The model is trained entirely in fp16 precision using DeepSpeed ZeRO Stage 2, with overall batch size 2048 (micro-batch size 8). No learning rate warm-up is used.

The training corpus consists of 150 billion tokens:

7 billion from the phi-1 dataset.
20 billion synthetic “textbook-quality” tokens covering 20K carefully selected topics aimed at improving common sense, world knowledge, and theory of mind.
6 billion filtered code tokens from phi-1.

Model variants include:

phi-1.5-web-only: trained on 95B tokens of filtered web data.
phi-1.5-web: combined web (40%), code (20%), and synthetic data (40%).

This methodology emphasizes that using high-quality synthetic data, with careful topic selection, can effectively compensate for model scale.

3. Performance on Benchmark Tasks

Phi-1.5 was evaluated on a spectrum of benchmarks for reasoning, knowledge, and code tasks:

On common sense reasoning (WinoGrande, ARC-Easy, ARC-Challenge, BoolQ, SIQA), phi-1.5 matches or exceeds models in the 7B–13B parameter range. For example, WinoGrande accuracy is 0.734.
For language understanding and factual knowledge (PIQA, Hellaswag, OpenbookQA, SQUAD, MMLU), phi-1.5 performs competitively with much larger models.
In arithmetic reasoning (GSM8K) and step-by-step multi-hop tasks, phi-1.5 displays advanced chain-of-thought capabilities.
Coding ability is strong: on HumanEval and MBPP, the model’s performance is comparable to phi-1 and approaches or surpasses larger models such as Llama-65B.

Performance differences across data variants indicate that filtered web data can improve reasoning tasks, but training strictly on synthetic data remains remarkably competitive.

4. Impact of Data Curation and Synthetic Corpora

Phi-1.5’s principal innovation is the exclusive use of “textbook-quality” synthetic data for the majority of its training. The 20B-token synthetic dataset is generated using established LLMs through prompt engineering over a broad set of topics specifically constructed to test reasoning and theory of mind. No raw web data is included in the primary phi-1.5 model, distinguishing it from its variants and much of the model class.

This focus on synthetic, highly-structured content confers the following properties:

Improved common sense reasoning and coverage of non-trivial connections (theory of mind, multi-hop inference) at model sizes previously incapable of such generalization.
Reduced propensity for toxic or biased language; ToxiGen and custom prompt evaluations show that phi-1.5 emits less toxic content than comparable models (e.g., Llama2-7B, Falcon-7B).
The absence of internet noise in the training data provides a more systematically interpretable base model for research.

5. Limitations and Model Behavior

Despite improvements, phi-1.5 is not immune to classic LLM failures:

It exhibits occasional hallucinations and factual inaccuracies in responses.
Toxic or biased generations, while reduced relative to web-data-trained models, are not completely eliminated.
No instruction fine-tuning or RLHF is applied, meaning out-of-the-box conversation control and safety are limited to effects of pretraining data alone.

These limitations are instructive for researchers exploring the boundaries of mitigation via pretraining data quality versus further alignment through explicit supervised fine-tuning or reinforcement methods.

6. Open Sourcing and Research Relevance

Phi-1.5 is released under an open-source license, without additional fine-tuning layers or alignment, aiming to facilitate research in several advanced LLM topics:

In-context learning: studying behavior and induction capabilities in a distilled setting.
Mechanistic interpretability: the compact model and curated data make structural analysis feasible with contemporary tooling.
Hallucination and bias reduction: exploration of the impact of high-quality synthetic pretraining on model safety and reliability.

Its manageable size (1.3B parameters) enables rapid experimental turnaround on consumer hardware and in academic environments not equipped for frontier-scale inference.

7. Comparative Context and Future Directions

Phi-1.5 exemplifies a significant transition in LLM development strategy—leveraging high-quality, curated synthetic data over scale and unfiltered web crawls. Results demonstrate that with deliberate topic engineering and synthetic content, small models can approximate or even challenge the performance envelopes of much larger competitors.

The approach provides an important empirical reference for measuring the trade-offs between data quality, scaling law extrapolation, and model safety, and suggests a research agenda focused on:

Systematic data synthesis and its effects across modalities and languages.
Comparative studies between synthetic and web-derived training corpora on emergent LLM properties.
Extension of this line to richer alignment, dialogue, and interpretability protocols in compact models.

Phi-1.5 thus serves as a baseline for more efficient, sustainable, and controllable LLM development, supporting a broader research agenda into the scaling behavior and safety of LLMs (Li et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Textbooks Are All You Need II: phi-1.5 technical report (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Phi-1.5.