System-1/System-2 Distillation

Updated 4 July 2025

System-1/System-2 Distillation is a methodology that converts deliberate, multi-step reasoning into rapid, one-step inference.
It leverages high-quality System 2 outputs through robust filtering and self-consistency checks to train efficient System 1 models.
The approach reduces inference cost while enhancing accuracy, supporting applications in language and vision tasks for adaptive AI deployment.

System-1/System-2 distillation refers to a set of methodologies, primarily developed for LLMs and vision-LLMs (VLMs), that enable the "compilation" of deliberative, explicit reasoning processes (System 2) into the fast, intuitive inference pathways characteristic of System 1. This approach seeks to leverage the higher quality of System 2 outputs for supervised fine-tuning, yielding models that operate at System 1 speed but exhibit some of the advanced reasoning capacity of System 2. The paradigm has foundational implications for the efficiency, adaptability, and continual learning capabilities of AI systems.

1. Formal Definitions and Conceptual Distinctions

System 1 and System 2, terms drawn from cognitive science, are formally operationalized in the context of LLMs as follows:

System 1: Direct, automatic response generation. The model maps input $x$ to output $y$ in a single inference pass, e.g. $S_1(x) = p_\theta(x) \to y$ , with no explicit intermediate reasoning tokens.
System 2: Deliberative computation, where the model generates intermediate reasoning sequences $z$ (which may include branching, searching, and iterative refinement), emitting $y$ only after substantial multi-step processing: $S_2(x; p_\theta) \to z, y$ .

System-2 methods such as Chain-of-Thought, Rephrase-and-Respond (RaR), System 2 Attention, and Branch-Solve-Merge, generally involve increased inference costs (multiple forward passes, extra tokens) but can significantly improve response quality over vanilla System 1 operation (Yu et al., 8 Jul 2024).

2. System-2-to-System-1 Distillation: Core Methodology

The core idea of distilling System 2 into System 1 is to harvest the stronger outputs of System 2 pipelines and use them as training targets to "compile" these abilities into the model's direct inference abilities. The distillation process, grounded in self-supervised learning, consists of several steps:

Data Generation: For a large set of unlabeled prompts $\mathcal{X}$ , each input $x^i$ is processed using a System 2 method to obtain a high-quality answer $y_{S_2}^i = S_2(x^i; p_\theta)$ . Intermediate tokens $z$ are discarded.
Quality Filtering: As System 2 may still err, consistency checks curate targets:
- Self-consistency voting: Accept example $(x^i, y^*)$ only if $y^*$ is supported by a majority across $N$ System 2 samples.
- Input perturbation: Accept only if outputs are consistent across perturbed input forms.
- Universal Self-Consistency (USC): The LLM itself selects the most self-consistent answer.
Distillation Training: The model is then fine-tuned on the curated set $(\mathcal{X}_{S_2}, \mathcal{Y}_{S_2})$ , minimizing the negative log-likelihood:

$\min_{\theta'} ~ \mathbb{E}_{(x, y^*)} \left[ - \log p_{\theta'}(y^*|x) \right]$

The resulting parameters $\hat{\theta}$ define a System 1 model expected to produce System 2-quality outputs in a single inference step (Yu et al., 8 Jul 2024).

3. Extensions in Perceptual and Vision Tasks

Subsequent work has expanded System-2-to-System-1 distillation to perceptual tasks, where the division between rapid, intuitive perception (System 1) and explicit, stepwise visual reasoning (System 2) is pronounced. "LongPerceptualThoughts" introduces a structured synthesis of long, deliberative reasoning traces in vision tasks and demonstrates substantial downstream gains when fine-tuning VLMs (Liao et al., 21 Apr 2025). The process involves:

Stage 1: Generating verifiable multiple-choice questions from dense image captions using a strong LLM (e.g., GPT-4o-mini).
Stage 2: Extracting simple chain-of-thought explanations from the target VLM.
Stage 3: Expanding simple traces into elaborate, multi-step CoTs using a frontier reasoning model (e.g., DeepSeek R1-Distill-Qwen-32B), with explicit cues for reflection, verification, and error-correction.

Training datasets (via SFT and DPO) are built with correct answers and preferred reasoning traces, which facilitate preference optimization.

4. Mathematical Framework

System 2 procedures are interpreted as operators: $S_2(x; p_\theta) \to z, y$ where $z$ is the reasoning sequence. The objective is to train a System 1 mapping

$S_1'(x) = \hat{p}_\theta(x) \approx S_2(x; p_\theta)$

such that, for distribution $\mathcal{X}$ ,

$\min_{\theta'}~ \mathbb{E}_{x \sim \mathcal{X}} \left[ -\log p_{\theta'}(y_{S_2}(x)\mid x) \right]$

This classic distillation approach is distinguished by the teacher and student sharing the same base architecture, with the "teaching" delivered via enhanced inference rather than model size.

Preference-based optimization, such as DPO, further enforces that not only the correct answer but also the correct reasoning process is prioritized: $(q, y_1, a_1^+, a_1^+) \succ (q, y_1, a_1^-, a_1^-)$ (Liao et al., 21 Apr 2025).

5. Empirical Effects: Cost, Performance, and Generalization

Distillation of System 2 into System 1 leads to marked reductions in inference cost:

System 2 inference can require multiple model calls and long outputs, with up to 100× more tokens processed.
After distillation, a single model call suffices, at standard System 1 cost (e.g., RaR task: System 2 requires two calls and 41–112 output tokens; distilled System 1 needs one call and 25 tokens).

Performance improvements are documented across tasks: | Model | Accuracy (Symbolic Task) | Tokens Generated | |-----------------------------|--------------------------|------------------| | System 1 (Vanilla) | 30.0% | 27.1 | | System 2 (2-Step RaR) | 44.5% | 41.5 | | Distilled System 2 (RaR) | 98.0% | 25.5 |

Notably, in vision tasks, LongPerceptualThoughts-tuned models achieve a +3.4 point average improvement over base on five benchmarks, including +11.8 on V $^*$ Bench, with rare positive transfer to text-only reasoning (+2 on MMLU-Pro) (Liao et al., 21 Apr 2025).

Some tasks (notably, complex compositional math) remain resistant to System 1 distillation, indicating an upper bound on which System 2 abilities can be "compiled."

6. Applications and Continuous Adaptation

System-2-to-System-1 distillation underpins several practical uses:

Efficient deployment: Advanced reasoning becomes economical for latency-sensitive LLM applications.
Bias mitigation and automatic evaluation: Distilled models inherit robust self-correction and evaluation abilities from System 2 techniques.
Instruction following: Enhanced symbolic reasoning and interpretation via distilled protocols.

For continual learning, the paradigm supports periodic re-distillation as new reasoning challenges arise, mirroring the human process of proceduralizing practiced deliberations for automatic retrieval. As a result, AI systems can preserve computational resources for genuinely novel or complex reasoning, while updating their System 1 core as additional System 2 routines are mastered.

7. Limitations and Research Trajectories

Key limitations include:

Certain forms of reasoning (lengthy, highly compositional, or inherently symbolic chains) resist distillation from System 2 to System 1.
The success of distillation depends critically on the quality of teacher outputs and filtering (e.g., robust self-consistency or input perturbation methods).
Open research questions remain regarding generalization: the transferability of distilled reasoning skills to new domains or task types.

Prospective directions focus on adaptive thinking trace lengths, improved process verifiers for perception tasks, expansion to more diverse domains, the adoption of hierarchical reasoning traces, and the integration of online/reinforcement learning signals for continual adaptation (Liao et al., 21 Apr 2025).

System-1/System-2 distillation constitutes a significant methodological advance in compiling deliberative, resource-intensive reasoning strategies into the rapid, efficient inference engines of modern AI. It further provides a principled path toward continual, resource-adaptive learning, with broad implications for deploying competent, generalist AI systems in varied, cost-sensitive environments.

PDF Markdown Chat (Upgrade)

References (2)

1.

Distilling System 2 into System 1 (2024)

2.

LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception (2025)