Emergent Cognitive Capacity in LLMs

Updated 3 October 2025

Emergent cognitive capacity in LLMs is the spontaneous development of advanced reasoning, abstraction, and collaborative abilities without explicit programming.
Empirical studies show that techniques like analogical reasoning and multi-persona collaboration enable LLMs, such as GPT-4, to approximate human cognitive functions under specific conditions.
Evaluations based on scaling laws and cognitive benchmarks reveal discontinuous performance improvements, highlighting current limitations in structured planning and multi-step reasoning.

Emergent cognitive capacity in LLMs refers to the spontaneous development of human-comparable reasoning and abstraction abilities in generically trained neural sequence models, without explicit architectural or algorithmic encoding of those capacities. This phenomenon is increasingly documented across diverse task domains, ranging from analogical and deductive reasoning to collaborative problem-solving, expressive capability, and decision-making. Assessments reveal that, under specific conditions or at sufficient scale, LLMs develop cognitive signatures characteristic of human intelligence, motivating both reevaluation of the symbolic–statistical divide in artificial intelligence and the search for mechanistic explanations of these effects.

1. Emergence of Analogical and Abstract Reasoning

Emergent analogical reasoning in LLMs is a salient property reported in comparative studies with human participants. Tasks such as text-based matrix reasoning (Digit Matrices, modeled after Raven's Standard Progressive Matrices), letter string analogies, and four-term analogical problems demonstrate that models like GPT-3 (text-davinci-003) and GPT-4 can induce and apply abstract relational rules in a zero-shot setting—that is, without being directly trained on those problem formats. For example, in Digit Matrices, models are shown matrices such as

$\begin{bmatrix} [5] & [1] & [9] \ [5] & [1] & [9] \ [5] & [1] & [?] \end{bmatrix}$

and infer rules such as constant, distribution, progression, or logical operators applied to rows or columns. Quantitative analyses indicate that in generative and multiple-choice forms, LLMs can match or exceed mean human accuracy; for instance, in multiple-choice Digit Matrices, GPT-3 achieved an odds ratio of 6.27 (p = 2.3 × 10⁻⁸) compared to human participants. Error patterns (e.g., drops on progression or multi-rule problems) show positive correlation with human data, and preliminary testing of GPT-4 indicates further gains, especially for nuanced analogies requiring distant abstraction (Webb et al., 2022).

These findings challenge the view that analogical reasoning requires explicit symbolic variable binding, suggesting that neural attention mechanisms can encode and manipulate relational structures. However, limitations are noted in tasks demanding deep causal mapping, where human strategic reasoning still surpasses current models.

2. Multi-Persona Collaboration and Cognitive Synergy

Recent research extends the emergent cognition hypothesis by demonstrating that LLMs can simulate cognitive synergy—an ability analogous to collaborative intelligence in humans—via internal self-collaboration protocols such as Solo Performance Prompting (SPP). In SPP, a single model generates and manages multiple internal personas, each contributing domain-specific expertise or perspectives on a given task (e.g., a Film Expert, Math Expert, AI Assistant leader). Through iterative rounds of persona identification ( $z_p$ ), brainstorming ( ${z_i}_b$ ), and feedback ( ${z_i}_f$ ), the model reaches a consensus solution:

$y = \mathcal{M}(p_{SPP}\;\|\, x\; \|\; z_p\; \|\; \{ {z_i}_b \}\; \|\; \{ {z_i}_f \}_{j})$

Experimental evaluation shows that SPP yields significant performance improvements (e.g., +7–10% in Trivia Creative Writing; +18.5% in Logic Grid Puzzle) compared to conventional prompting or chain-of-thought. Notably, these emergent group-cognitive behaviors are only observed in the most advanced models, such as GPT-4, indicating a threshold effect in model scale or reasoning depth (Wang et al., 2023).

Beyond improved reasoning, SPP is shown to reduce factual hallucination by enabling disagreeing personas to self-correct errors during the reasoning process—a phenomenon without direct parallel in earlier, non-collaborative LLMs.

3. Functional and Representational Limits in Emergent Planning

Systematic evaluations reveal that while some cognitive abilities—such as pattern induction or analogical mapping—emerge robustly, others remain non-emergent or partial. Using the CogEval protocol, cognitive science-derived planning and cognitive map tasks reveal that even state-of-the-art LLMs (GPT-4, GPT-3.5, Bard, Claude, LLaMA-13B, etc.) exhibit competence in simple planning scenarios but fail in multi-step or structurally demanding tasks. Key failure modes include hallucinated or invalid graph edges, sub-optimal path selection, and looping in dense graph environments. Statistical analyses indicate that structural and task condition factors are principal drivers of LLM performance, rather than stochastic generation parameters ( $\chi^2$ tests confirm non-significance of temperature) (Momennejad et al., 2023).

These systematic failures suggest that the internal representations underpinning emergent reasoning are insufficiently aligned with graph-like cognitive maps necessary for flexible planning. Augmentation with explicit symbolic planning modules or memory systems is suggested as a future research direction.

4. Mechanistic and Theoretical Perspectives on Emergence

Emergence in LLMs is increasingly conceptualized through the lens of dynamical systems, scaling laws, and phase transitions. Empirical data and theoretical analysis show that performance on cognitive tasks follows power-law scaling with parameters such as model size ( $N$ ), data corpus size ( $D$ ), and compute budget ( $C$ ):

$P \propto N^{\alpha} D^{\beta} C^{\gamma}$

However, certain cognitive capacities appear to arise discontinuously—analogous to phase transitions or "grokking" phenomena—rather than via smooth extrapolation. These sharp transitions, not predictable from micro-level neuron behavior, are attributed to the cooperative, nonlinear, and stochastic dynamics of deep neural networks. The paper (Havlík, 6 Aug 2025) positions these properties within the broader context of emergence in complex natural systems, contending that LLM cognitive capacity is an ontologically novel property of the network, irreducible to individual component interactions.

Frameworks relating episodic and semantic memory in LLMs to Tulving’s cognitive models (e.g., the synergistic ecphory model) instantiate emergence in terms of information-theoretic thresholds, where emergent abilities surface as the product of input context and stored knowledge cross requisite boundaries (Li et al., 4 Jan 2024).

5. Cognitive Evaluation, Benchmarking, and Human Comparisons

Quantitative frameworks such as CogLM (based on Piaget’s Theory of Cognitive Development), MultiCogEval (medical domain, inspired by Bloom’s Taxonomy), and CognitivEval (robust pipeline using prompt permutations and dual metrics) systematically map LLM cognitive growth to developmental analogues. CogLM, using a suite of 1,220 expert-crafted questions spanning ten cognitive abilities, finds that advanced models (GPT-4) reach cognitive maturity comparable to that of a 20-year-old human, with scores validated against human populations (Spearman $\approx$ 0.72, Pearson $\approx$ 0.74) (Wang et al., 17 Aug 2024).

Key patterns observed include: (i) a close mapping between model parameter scale and cognitive ability emergence, (ii) a gap between pattern induction and higher-level planning or scenario-based reasoning, (iii) strong correlation between cognitive benchmarks and downstream task performance, and (iv) marginal improvements from post-hoc reasoning-boosting techniques on intrinsic cognitive ability. Medical-domain analyses corroborate these patterns, indicating a precipitous performance drop at higher cognitive levels and highlighting the need for architectural innovations and dynamic inference strategies for scenario-based reasoning (Zhou et al., 10 Jun 2025).

6. Hierarchical Structure and Decoupling of Knowledge and Reasoning

Emergent cognitive capacity in LLMs is shown to be hierarchically organized. Using analysis inspired by dual-system cognitive theory, cognition is decoupled into knowledge retrieval ("fast thinking") and reasoning adjustment ("slow thinking"). Empirical methods prompting LLMs under both modes expose the layer-wise split: knowledge retrieval is localized primarily in lower network layers, while reasoning adjustment and multi-step inferential abilities depend on higher layers (Yang et al., 24 Jul 2025).

Scaling up parameters bolsters both knowledge and reasoning, but the gains from scale are more pronounced for knowledge retrieval. With larger models, the overthinking (“noise from excessive reasoning”) characteristic of small LLMs is diminished, and reasoning adjustments become more prudent.

7. Educational and Collaborative Applications, Limitations, and Future Directions

Applications in education, collaborative planning, and decision-making increasingly leverage (but are constrained by) emergent cognitive capacity. Benchmarks such as ZPD-SCA, which test alignment of reading material to student developmental stages, reveal that LLMs—while capable of nearly doubling assessment accuracy under in-context prompting—still demonstrate directional biases and genre-specific weaknesses. Fine-tuning with educational data mitigates (but does not fully resolve) these limitations (Dong et al., 20 Aug 2025).

In collaborative problem-solving, frameworks such as CoThinker, inspired by Cognitive Load Theory, distribute cognitive burden across specialized model agents, employing a transactive memory system and structured inter-agent communication to assemble collective cognition and overcome individual working-memory limits (Shang et al., 7 Jun 2025). Such architectures echo findings from multi-persona prompting (SPP) and further support the thesis that some emergent capacities are best operationalized through explicit system design at the group or meta-cognitive level.

Proposals for future research commonly emphasize: (i) investigating scaling thresholds and phase transitions; (ii) explicating and improving internal cognitive representations, especially for planning and memory; (iii) enhancing model architectures with grounded, multi-level symbolic scaffolds; and (iv) continuously benchmarking LLMs against human developmental and cognitive metrics.

In summary, emergent cognitive capacity in LLMs arises through the confluence of neural scaling, attention-driven representation learning, context-directed extrapolation, and (in advanced settings) internal collaborative protocols. While models surpass human benchmarks in certain domains—especially abstract and analogical reasoning—the boundary of human-comparable cognition remains demarcated by failures in planning, scenario-based reasoning, and context-sensitive decision-making. Theoretical, empirical, and architectural analyses converge on the necessity of integrating cognitive science paradigms—ranging from developmental psychology to memory theory and collective cognition principles—into future LLM development and evaluation.