Evolutionary System 2 Reasoning: An Empirical Proof

Published 5 Dec 2025 in cs.AI | (2512.05760v1)

Abstract: Machine intelligence marks the ultimate dream of making machines' intelligence comparable to human beings. While recent progress in LLMs show substantial specific skills for a wide array of downstream tasks, they more or less fall shorts in general intelligence. Following correlation between intelligence and system 2 reasoning (slow thinking), in this paper, we aim to answering a worthwhile research question: could machine intelligence such as LLMs be evolved to acquire reasoning ability (not specific skill) just like our human beings? To this end, we propose evolutionary reasoning optimization (ERO) framework which performs survival of the fittest over a population of LLMs to search for individual with strong reasoning ability. Given a reasoning task, ERO first initializes multiple LLMs as a population, after which an evolutionary strategy evolves the population to maximize quantified reasoning score of the best individual. Based on experiments on representative testsuites, we claim two surprising empirical discoveries: i) the latest LLMs such as GPT-5 still show limited system 2 reasoning ability; ii) with simple evolution-loop of ERO, a relatively weak model (Qwen-7B) could be enhanced to emerge powerful reasoning ability. Our project can be accessed at https://github.com/MetaEvo/ERO for reproduction needs.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents the ERO framework that leverages evolutionary algorithms and layer-wise covariance to enhance System 2 reasoning in LLMs.
It demonstrates that an evolved Qwen-7B model outperforms larger models like GPT-5 on ARC benchmarks with improved pass@1 scores.
The study challenges traditional scaling laws by showing that targeted evolutionary adaptations, rather than increased model size, yield significant gains in reasoning performance.

Evolutionary System 2 Reasoning in LLMs: Analysis of the ERO Framework

Motivation and Problem Formulation

LLMs have demonstrated utility across a multitude of domains; however, current empirical evidence consistently reveals their pronounced incapacity for System 2 reasoning—deliberate, compositional, and logical cognitive processing. Leading models such as GPT-5 display substantial limitations on benchmarks such as the Abstraction and Reasoning Corpus (ARC), where even top-tier models saturate well below human performance (50% vs. 100%). Central to this gap is the divergence between human intelligence, shaped by open-ended evolution enforcing survival-of-the-fittest pressures, and LLMs, which are confined by fixed, task-specific pretraining regimens.

The paper advances the hypothesis that, by leveraging evolutionary algorithms reminiscent of neuroevolution, LLM parameters can be adaptively optimized under selective pressure to explicitly maximize reasoning performance. The principal research question is thus: Can a competitive evolutionary strategy on LLM populations confer emergent System 2 reasoning, and under what operational constraints does this become tractable?

Figure 1: A prototypical ARC reasoning task that targets compositional system-level inference, forming the basis of evaluation for System 2 capability.

The Evolutionary Reasoning Optimization (ERO) Framework

The Evolutionary Reasoning Optimization (ERO) framework adapts multipopulation evolutionary strategies for LLM reasoning enhancement. LLM parameters constitute the genotype; reasoning tasks correspond to the selective environment. The workflow is characterized by the following technical constructs:

Layer-Wise Covariance Initialization: Instead of naïve isotropic or fixed covariances, ERO computes layer-specific covariance using the initial parameter statistics, effectively controlling search granularity and variance on a per-layer basis. This addresses the heterogeneity in parameter scale across architectures of LLMs and avoids degenerate exploration.
Population and Island Model: Populations of LLMs are instantiated in parallel ("islands"), each independently sampling candidates from a Gaussian centered at the current elite mean with fixed covariance. Periodic aggregation of elites across islands promotes both intra- and inter-population diversity.
Objective and Scoring: For each reasoning task $\tau$ , performance is quantified using a generalized scoring function, exemplified on ARC by normalized Levenshtein string distances over structured output arrays. This permits gradient-free, black-box optimization agnostic to the internal LLM structure.
Efficiency Mechanisms: Key bottlenecks (model loading, evaluation throughput) are mitigated via on-the-fly cache strategies and Ray-based scheduler parallelism, enabling practical execution on constrained multi-GPU setups.
Figure 2: Illustration of standardized prompting used in evaluation to maintain baseline parity during ERO and ablation studies.

Empirical Results

Experiments focus primarily on 15 sampled tasks from ARC, explicitly chosen to span innate core knowledge domains (object cohesion, persistence, number, geometry, etc.) per cognitive-science-congruent categorization. The principal model under evolution is Qwen-7B; comparisons are made to both larger (Qwen-32B, GPT-4o) and state-of-the-art (GPT-5) LLMs.

Performance Trajectories: ERO systematically amplifies the reasoning performance of initially weak models. Across generations, mean pass@1 scores for the evolved Qwen-7B model surpass those of considerably larger and more heavily trained (pretrained and RLHF-refined) competitors.
Figure 3: Evolution curves tracking per-generation average ARC pass@1, benchmarking ERO-evolved Qwen-7B against baseline SOTA models.
Task-Wise Comparison: In 8/15 tasks, the ERO-optimized Qwen-7B outperforms GPT-5. The efficacy is observable in instances tapping varied cognitive priors, demonstrating the optimization’s generalization across logic, arithmetic, and visuospatial domains rather than overfitting to isolated benchmark idiosyncrasies.
Figure 4: Qualitative showcases, highlighting before-and-after reasoning trajectories for select ARC instances where ERO confers emergent compositional reasoning.
Scaling Law Contradiction: Results strongly challenge the scaling hypothesis for reasoning ability. Larger models (Qwen-32B, GPT-5) do not universally surpass their smaller counterparts. Instead, post-hoc mutation and selection—instead of model size expansion—yield superior gains, indicating that System 2 reasoning is not linearly correlated with parameter count or corpus breadth.

Implications and Theoretical Considerations

ERO’s core finding is the empirical decoupling of model scale and reasoning performance, instead showing that evolutionary selective pressures can unlock latent compositional abilities within existing model architectures. This reframes the debate on LLM improvement, extending optimization beyond pretraining and instruction tuning into domains of black-box, objective-driven parameter search.

Practically, this translates to feasible post-training enhancement pipelines that are compatible with limited hardware; the evolutionary loop, cache optimization, and parallelism afford tractable exploration of the massive LLM parameter space. Theoretically, ERO introduces a new axis for research on machine intelligence: meta-evolution over reasoning task distributions rather than per-task adaptation, gesturing toward continual, open-ended adaptation reminiscent of human evolution.

Future research directions include scaling ERO to meta-optimization—averaging reasoning scores over a manifold of diverse reasoning environments—which would better capture the generalization sought in general intelligence studies and systematically address the compositionality bottleneck prevailing in current LLMs.

Conclusion

This work operationalizes evolutionary principles for LLM reasoning enhancement, establishing that substantial gains in System 2 reasoning can be achieved by survival-driven optimization in the parameter space, rather than expanding corpus or model size. ERO offers a pragmatic and theoretically sound framework, with clear evidence that evolutionary computation bridges the gap between connectionist architectures and the adaptive, selective processes underpinning human intelligence. These insights motivate the integration of evolutionary dynamics into future LLM design for robust, compositional reasoning.

Markdown Report Issue