Mixed System-1/System-2 Distillation

Updated 4 July 2025

Mixed System-1/System-2 distillation is a methodology that integrates deliberate, resource-intensive reasoning with rapid, automatic responses to enhance AI performance.
It employs techniques like self-supervised output distillation, closed-loop policy training, and synthetic data pipelines to streamline complex processes in NLP, control, and vision-language systems.
Empirical evaluations show significant gains in accuracy and efficiency, reducing inference costs while maintaining robust performance across diverse applications.

Mixed System-1/System-2 Distillation is a set of methodologies and cognitive science-inspired principles that synthesize the deliberate, resource-intensive processes of “System-2” reasoning into the direct, fast-acting mechanisms of “System-1,” enabling AI systems to combine flexibility, interpretability, and efficiency. The concept formalizes the transfer and integration between slow, analytical procedures and fast, automatic responses, applying across cognitive architectures, neural network control systems, LLMs, and vision-language agents.

1. Foundations and Theoretical Frameworks

The distinction between System-1 and System-2 originates in psychological theory. System-1 is characterized as fast, automatic, associative, emotion-laden, non-symbolic, and largely unconscious processes that do not require explicit working memory. System-2 encompasses slow, deliberative, analytic, symbolic, effortful, and conscious functions reliant on working memory. Traditional views treat these as dichotomous, but research within the Common Model of Cognition (CMC) demonstrates that real cognitive phenomena form a spectrum with interleaved, emergent properties.

Within the CMC, cognition is structured into five core modules: Perception, Action, Working Memory (WM), Declarative Memory (DM), and Procedural Memory (PM). The procedural system is instantiated as a production system:

$P_i: \text{if } C_i(\mathbf{w}) \text{ then } A_i$

where $P_i$ is a production rule conditioned on working memory content $\mathbf{w}$ , and $A_i$ is the resulting action. Deliberative, resource-intensive sequences of productions and memory retrievals constitute System-2, while routine, rapid production firings without explicit retrievals are viewed as System-1. CMC asserts that System-2 operations are built atop System-1 machinery and that transitions along the spectrum are achieved through mechanisms such as production compilation:

$(P_1, ..., P_n) \implies P^*$

where a sequence of deliberative productions is compiled into a single, automatic production through repeated experience (Conway-Smith et al., 2023).

2. Methodologies for System-2-to-System-1 Distillation

Modern AI leverages several approaches to achieve mixed System-1/System-2 distillation, with the central objective of “compiling” high-quality outputs from System-2 processes into streamlined System-1 mechanisms:

Self-supervised Output Distillation: LLMs using System-2 techniques (e.g., Chain-of-Thought, Rephrase and Respond, System-2 Attention) are prompted on unlabeled data, and their high-quality final answers (but not their intermediate reasoning steps) are collected. These outputs are filtered via self-consistency and input perturbation methods to ensure correctness and robustness. The model is then fine-tuned to produce these answers directly from input, typically using cross-entropy loss (Yu et al., 8 Jul 2024).
Closed-loop Policy Distillation in Control: In dynamic systems (e.g., industrial control), Model Predictive Control (MPC)—the gold standard for multivariate, constrained regulation—requires online optimization and full state estimation. To distill System-2 expertise, a neural network is trained offline in the loop, learning to map noisy or partial measurements directly to control actions, without running the expensive optimizations in deployment. Measurement selection may be performed manually (using domain knowledge) or automatically (via elastic net regularization). The result is a controller that delivers near-optimal performance with System-1 speed at runtime (Turan et al., 29 Feb 2024).
Synthetic Data Frameworks for Vision-LLMs: For perceptual and vision reasoning tasks, a three-stage data synthesis pipeline builds mixed reasoning skills. Stage 1 generates verifiable multiple-choice questions from dense image descriptions. Stage 2 extracts simple, familiar chain-of-thought traces from a compact VLM. Stage 3 leverages a stronger, reasoning-focused teacher model to expand these into long, elaborate thoughts. Fine-tuning with these traces, combined with preference-based objectives (e.g., Direct Preference Optimization), results in models capable of fast, vision-centric reasoning that incorporates System-2 style verification and subgoaling (Liao et al., 21 Apr 2025).

3. Empirical Performance and Metrics

Evaluation of mixed System-1/System-2 distillation utilizes accuracy, inference cost, and robustness metrics:

Accuracy Improvements: Distilled models often achieve or surpass System-2 performance with System-1 efficiency in tasks like logic puzzles, preference judgments, and visual question answering. For instance, in last letter concatenation tasks, distilled System-1 achieved exact match (EM) scores of 98.0%, significantly outperforming baseline System-1 (30.0%) and System-2 (44.5%) (Yu et al., 8 Jul 2024). In vision-centric tasks, fine-tuning with synthetic distillation data improved performance by +3.4 points on average across five benchmarks, notably +11.8 points on V $^*$ Bench (Liao et al., 21 Apr 2025).
Inference Cost Reduction: System-2 methods incur substantial computation via multi-stage generation—often hundreds or thousands of tokens per input—while distilled System-1 policies require only a single call and minimal tokens. For example, distilled BSM (Branch-Solve-Merge) as a judge used just 4 tokens per input compared to 2117.8 tokens for its System-2 counterpart (Yu et al., 8 Jul 2024).
Robustness to Noise and Mismatch: In control system applications, manually selected minimal measurement policies demonstrate greater robustness to noise/model mismatch than automatically regularized selections, as manual choices enforce feedback-dominated control (Turan et al., 29 Feb 2024).
Limitations and Negative Cases: Where System-2 requires serial symbolic manipulations (e.g., math reasoning with Chain-of-Thought), distilled System-1 models often fail to match System-2 performance (e.g., GSM8k: 7.1% System-1 vs. 59.4% System-2 for math) (Yu et al., 8 Jul 2024).

Method / Setting	Task/Domain	Baseline S1	S2	Distilled S2
Rephrase & Respond (RaR)	Last Letter Concatenation	30.0%	44.5%	98.0%
Neural Policy (NN)	Distillation Control Obj.	—	0.0076*	0.0087–0.0096
LongPerceptualThoughts SFT+DPO	Vision (V $^*$ Bench)	—	—	+11.8 pts

*MPC cost, for comparison.

4. Applications Across Domains

Mixed System-1/System-2 distillation methodologies have been implemented in:

Natural Language Processing: LLMs for factual retrieval, reading comprehension, logic puzzles, branch-judge evaluation, and bias reduction can be efficiently distilled, reducing response length and bias while improving correctness (Yu et al., 8 Jul 2024).
Scientific and Industrial Control: Large-scale neural network controllers for nonlinear process plants (e.g., distillation columns) achieve MPC-level performance with minimal computation and strong robustness if measurement selection is handled properly (Turan et al., 29 Feb 2024).
Vision-Language Understanding: Vision-LLMs fine-tuned with mixed-reasoning synthetic datasets generalize cognitive strategies across modalities and benchmarks, even improving text-only reasoning without explicit retraining on those domains (Liao et al., 21 Apr 2025).
Metacognition and Skill Acquisition: Theoretical mapping within the CMC connects deliberate (System-2) metacognitive skill development to automatic (System-1) regulation, with procedural compilation allowing both forms of regulation to coexist (Conway-Smith et al., 2023).

5. Challenges and Practical Considerations

Despite clear advantages, several challenges persist:

Generalization and Overfitting: Automated distillation may overfit to nominal conditions if noise and domain variability are insufficiently modeled. Manual intervention in measurement or input selection is sometimes required for robustness (Turan et al., 29 Feb 2024).
Learnability Gap: If the target (student) model’s architecture or reasoning distribution is not properly aligned with the source (teacher/model or System-2 process), performance may degrade (as with direct transfer from frontier models without prior alignment) (Liao et al., 21 Apr 2025).
Limits of Distillation: Certain domains (e.g., math and code) involving deep serial reasoning exhibit resistance to effective System-2-to-System-1 compilation under current methodologies (Yu et al., 8 Jul 2024). This suggests that specific forms of intermediate supervision, alternative distillation targets, or hierarchical approaches might be necessary.
Resource Demands: Offline training for large, high-dimensional systems or high-variance tasks can remain computationally expensive, even if online inference is efficient (Turan et al., 29 Feb 2024).

6. Implications and Future Directions

Mixed System-1/System-2 distillation marks a convergence of cognitive architecture and modern AI, providing a principled pathway for efficient, continual learning and adaptation. The paradigm enables AI systems to rapidly adopt System-2 behaviors initially as deliberate, analytic procedures and then consolidate them as fast System-1 intuitions (“automaticity”). This cyclical approach allows intelligent agents to focus deliberate reasoning only on truly novel or complex problems, while deploying efficient policies for routine tasks. A plausible implication is broader generalization and lower deployment cost for advanced reasoning in practical systems.

Areas for further research include refining distillation targets for serial reasoning domains, advancing measurement/input selection methodologies, incorporating affective and metacognitive signals into the distillation process, and exploring the limits of continual learning cycles that alternate between deliberative skill acquisition and automatic policy refinement.

7. Summary Table: System-1 and System-2 Properties in CMC

Aspect	System-1 (CMC)	System-2 (CMC)
Memory	Fast productions (PM), can invoke DM	Productions + explicit DM retrieval, complex WM use
Speed	Fast, low-latency	Slow, multi-step, deliberative
Learning	Automatic, reinforcement, compilation	Deliberate practice, metacognition
Effort	Continuum, can be high or low	Continuum, usually higher for complex tasks
Emotion	Affective tagging in production/appraisal	Preference-driven, can override emotion
Metacognition	Implicit, automatic monitoring	Explicit, rule-based control

Mixed System-1/System-2 distillation, therefore, operationalizes the transition and coexistence of these cognitive modes within both natural and artificial systems, supporting robust, adaptive, and scalable intelligent behavior.