Emergent Capabilities in Language Models

Updated 20 March 2026

Emergent capabilities are qualitatively new abilities that arise abruptly in large-scale language models when critical thresholds in size or loss are reached.
Empirical measurements reveal distinct performance jumps and phase transitions using loss thresholds and scaling laws, offering concrete benchmarks for evaluation.
Underlying mechanisms involve sparse Bayesian inference and nonlinear dynamical systems, bridging classical symbolic approaches with subsymbolic neural architectures.

Emergent capabilities in LLMs constitute a striking phenomenon wherein qualitatively new abilities appear abruptly at certain points during model scaling or training, defying prediction by extrapolation from smaller models. These abilities span reasoning, abstraction, analogical inference, coding, autonomous tool use, prosody perception, and more. This entry synthesizes the technical definitions, underlying mechanisms, empirical evidence, measurement frameworks, controversies, and open questions regarding emergent capabilities, referencing foundational and recent literature across scaling laws, information theory, mechanistic interpretability, and complex systems.

1. Formal Definitions and Theoretical Perspectives

"Emergent ability" in LLMs is operationally defined as a capability absent in smaller models but present in larger ones, where performance curves cannot be simply predicted from low-scale results (Wei et al., 2022). Wei et al. (2022) formalize: an ability is emergent if, for model size $N$ , performance on some task is near-random below a threshold $N^*$ , then jumps to a qualitatively new regime at or above $N^*$ , violating predictable scaling trends.

Recent theory advances this further with a loss-centric definition: an emergent ability is present if, for pre-training loss $L$ and threshold $L^*$ , the task accuracy is at baseline for all $L > L^*$ and increases only when $L \leq L^*$ (Du et al., 2024). This loss-based definition unifies size- and compute-thresholded perspectives, as the loss is itself a deterministic function of model size, data, and compute via scaling laws.

Complex systems views (Krakauer et al.) assert emergence as a phase where a high-dimensional model admits new, lower-dimensional effective theories or representations, such that macroscopic behavior is parsimoniously predicted by these coarse variables, not by the full microscopic detail (Krakauer et al., 10 Jun 2025). Havlík (2025) distinguishes "weak emergence": macro-level properties become accessible only through full-brute force simulation, not analytic reduction from micro-dynamics (Havlík, 6 Aug 2025).

2. Underlying Mechanisms and Theoretical Accounts

Loss Thresholds and Scaling Laws

Empirical scaling laws (e.g., Kaplan et al.’s power laws) show that cross-entropy loss typically decreases smoothly with model size or compute. However, downstream task accuracy can jump discontinuously at certain critical loss thresholds $L^*$ (Du et al., 2024), which can be mapped via scaling relations to critical sizes $N^*$ . This establishes that emergent abilities are not always a simple function of raw parameter count, but of whether the model’s loss landscape traverses specific basins aligned with the target capability.

Bayesian Inference and Latent Space Sparsity

Latent space theory frames emergent abilities as implicit Bayesian inference over sparse joint distributions of meanings and utterances (Jiang, 2023). When the model approximates the marginal density of text sequences, dense coverage of the sparse joint distribution enables posterior inference of latent intentions, yielding emergent chain-of-thought, in-context learning, and robust instruction following—all arising from density estimation, not handcrafted reasoning modules.

Non-Ergodicity and Path Dependence

Marín introduces a non-ergodic framework, showing LLMs' next-token predictions are fundamentally path-dependent: their accessible state space grows adaptively in time, governed by architectural, training, and contextual constraints (Marín, 3 Jan 2025). Capability emergence is mathematically characterized as a phase transition in the combinatorial-semantic space, governed by a resource-constrained extension of Kauffman's theory of the adjacent possible.

Symbolic and Subsymbolic Integration

Mechanistic studies in large models, e.g., Llama3-70B, reveal that abstract reasoning is instantiated by emergent, attention-based pipelines implementing symbol abstraction, induction, and retrieval—effectively building symbolic reasoning circuits within a deep, subsymbolic architecture (Yang et al., 27 Feb 2025). This bridges classical symbolic AI and neural networks, resolving long-standing debates on their capacity for variable binding and abstraction.

3. Empirical Characterization, Measurement, and Taxonomies

Empirical detection of emergence hinges on quantifying sharp transitions in performance:

Downstream Metrics: Task accuracy, BLEU, Brier Score, Correct Choice Probability, etc. (Wei et al., 2022, Berti et al., 28 Feb 2025)
Loss Threshold: Emergence is signaled by a stepped increase in accuracy when pre-training loss dips below a numerical threshold (e.g., $L^*\approx2.2$ for MMLU-like tasks) (Du et al., 2024).
Behavioral Discontinuity: Quantitative deviation from linear or power-law scaling as measured by MAE or RMSD between observed and predicted performance (O'Brien et al., 2024).
Information Emergence (IE): Direct measurement in hidden states, e.g., mutual information between macro-level sequence representations and micro-level token states, revealing entropy reduction above what is explained by individual tokens (Chen et al., 2024).
Performance Distribution Modeling: "Random Scaling" shows that what appears as deterministic emergence may reflect an underlying continuous transition in the probability of success across random seeds, modeled as a mixture of successful and unsuccessful modes (Zhao et al., 24 Feb 2025).

Table: Categories and Representative Tasks of Emergent Abilities

Category	Typical Tasks/Phenomena	Literature
Few-shot In-context Learning (ICL)	Classification, QA from prompt	(Wei et al., 2022, Berti et al., 28 Feb 2025)
Chain-of-Thought Reasoning	Multi-step math, logic	(Wei et al., 2022, Berti et al., 28 Feb 2025)
Symbolic Abstraction	Variable binding, analogy induction	(Yang et al., 27 Feb 2025, Webb et al., 2022)
Advanced Reasoning	Analogical, causal, or exam solving	(Webb et al., 2022, Tang et al., 2024)
Multimodal Prosody	Prosody ↔ text ↔ prosody processing	(Qian et al., 27 Jul 2025)
Scientific Tools Use	Autonomous experiment planning	(Boiko et al., 2023)

4. Conditions, Task Sensitivity, and Limits of Emergence

Task and Data Dependence

Actual emergence thresholds vary across tasks; arithmetic and code tasks often require higher scales than linguistic tasks. Data complexity critically shapes the emergence locus: simplifying the language yields emergence at much smaller models (Muckatira et al., 2024). In language adaptation, emergence is governed by early training dynamics rather than final loss, and critical periods exist where prior generalization must be preserved to prevent irreversible loss of abilities (Elhady et al., 30 May 2025).

Prompting and Metric Effects

Prompting strategies (e.g., few-shot, chain-of-thought, instruction-tuning) can trigger or mask emergence by engaging latent capabilities that remain dormant under baseline prompting (Lu et al., 2023, Berti et al., 28 Feb 2025). The choice between continuous and discontinuous metrics affects the perceived "cliffness" of capability onset but does not abolish true distributional multimodality (Zhao et al., 24 Feb 2025).

Absence in Some Domains

In code-centric (software engineering) tasks, controlled studies demonstrate no clear emergence up to mid-scale models (≤16B), with improvements explained by smooth scaling rather than sharp phase transitions. This suggests that emergence is not universal across all domains or objective functions (O'Brien et al., 2024).

5. Complexity, Internal Organization, and Origin of Emergence

Complex Systems Framework

Emergence in LLMs is a manifestation of the complex-system property "more is different": higher-scale systems admit lower-dimensional effective theories (e.g., new reasoning manifolds, phase transitions in covariance spectra of weights) that screen off microscopic detail but robustly predict macroscopic behaviors (Krakauer et al., 10 Jun 2025). In such systems, only by identifying the new, reduced basis can one claim genuine emergence.

Dynamical Systems and Nonlinear Organization

DNNs, including LLMs, are high-dimensional, nonlinear dynamical systems exhibiting sensitive dependence on initial parameters, stochastic gradients, and cooperative neuron interactions. Emergent macro-level behaviors, such as zero-shot reasoning or planning, cannot be analytically reduced to simple neuron-level activity—they arise from self-organizing, phase-transition–like dynamics (Havlík, 6 Aug 2025).

Randomness and Distributional Effects

Apparent emergence can reflect stochastic variation across different training seeds, with certain parameter regions supporting both "successful" and "unsuccessful" outcomes. Mode-flips in mixing weights, rather than deterministic transitions, often account for observed empirical "cliffs" in performance (Zhao et al., 24 Feb 2025).

6. Humanlike Capabilities, Real-World Impact, and Safety

LLMs demonstrate emergent humanlike patterns—decision biases, dual-process reasoning, and creativity—once threshold scale and tuning are reached (Tang et al., 2024). Certain biases and reasoning patterns (framing, risk aversion, category induction) are absent in small models but appear in larger ones, with reasoning and creativity exhibiting steep S-curve thresholds between 10–100B parameters.

Real-world implications involve both positive (autonomous scientific research, assistive creativity, multimodal processing) and negative (deception, reward hacking, manipulation) emergent behaviors (Boiko et al., 2023, Berti et al., 28 Feb 2025). These call for continuous evaluation frameworks, mechanistic monitoring, and regulatory oversight to manage risks attendant to the unpredictable appearance of new capabilities.

7. Open Controversies and Future Directions

Key controversies include whether all reported "emergence" truly differs from enhanced in-context learning or instruction recall (Lu et al., 2023); the universality of phase-transition–like scaling across modalities and architectures; and how to mechanistically anticipate, steer, or bound emerging capabilities for safety. Open research frontiers include developing predictive theories for scaling exponents from first principles, mapping causal emergence sites in activation space, extending task- and architecture-agnostic metrics, and designing more granular, task-adaptive benchmarks (Berti et al., 28 Feb 2025, Chen et al., 2024).

In summary, emergent capabilities in LLMs represent a deeply non-linear phenomenon, arising at critical thresholds in loss, size, data coverage, or architecture complexity. They are instantiated by a confluence of scaling laws, sparse latent structure, non-ergodic path dependence, and dynamical systems principles. Understanding, forecasting, and safely harnessing these phenomena remain fundamental theoretical and practical challenges in deep learning research.