Emergent Abilities in LLMs

Updated 8 November 2025

Emergent Abilities in LLMs are defined as capabilities that emerge abruptly in large models, often due to the step-function nature of nonlinear evaluation metrics.
Metric-induced emergence shows that discontinuous measures like exact match can create the illusion of sudden leaps, whereas continuous metrics reveal smooth, predictable improvements.
This paradigm shift calls for rigorous metric selection, transparent reporting, and statistically robust evaluation to accurately assess model scaling and behavior.

LLMs have catalyzed extensive scientific discussion due to the appearance of "emergent abilities": capabilities that seem to manifest abruptly as model scale increases, but are absent from smaller models. The interpretation, significance, and measurement of these phenomena have been hotly debated. In one influential critique, Schaeffer, Miranda, and Koyejo present a comprehensive challenge to the standard view of emergence in LLMs, proposing that many, if not all, claimed emergent abilities arise not from intrinsic properties of neural scaling, but from the choice of evaluation metric. Their work rigorously investigates the mathematical, experimental, and statistical underpinnings of emergence, offering a paradigm shift in how such abilities should be evaluated and interpreted in LLM research.

1. Historical and Conceptual Context

Emergent abilities in LLMs are commonly defined as abilities present in large models which are absent in smaller models, with the transition in capability being abrupt and unpredictable with respect to model size or compute. These abilities have historically been identified by plotting downstream task performance against model scale and observing sharp "phase transitions"—as seen with arithmetic, multi-step reasoning, and various benchmarks—using accuracy or other discrete metrics. The prevailing assumption in the literature was that such emergence indicated fundamental, qualitative changes in model behavior or learning dynamics, analogous to phase transitions in physical systems (Wei et al., 2022).

2. Metric-Induced Emergence: Core Arguments and Mathematical Framework

The principal argument advanced by Schaeffer et al. is that emergent abilities are in most instances artifacts of the evaluation metric's mathematical structure rather than evidence of a qualitative change in model cognition. Specifically, if one selects a nonlinear or discontinuous metric—for example, exact string match, multiple-choice grade, or all-or-nothing sequence accuracy—a smooth, continuous improvement in per-token prediction can appear as a sudden, threshold-crossing leap in the chosen metric.

The authors formally model this using established scaling laws for per-token cross entropy loss: $\mathcal{L}_{CE}(N) = \left(\frac{N}{c}\right)^\alpha$ where $N$ is parameter count, $c$ a scaling constant, and $\alpha$ a negative exponent.

For a sequence of length $L$ , if the probability of generating a correct token is $p_{\text{token}} = \exp(-\mathcal{L}_{CE}(N))$ , then:

Nonlinear metric (exact match):

$\text{Accuracy}(N) \approx p_{\text{token}}^L = \exp(-L \mathcal{L}_{CE}(N))$

This function exhibits a geometric dropoff with $L$ , leading to the appearance of an abrupt transition (see Fig. 1b of the paper).

Linear metric (token edit distance):

$\text{Token Edit Distance}(N) \approx L \cdot (1 - p_{\text{token}})$

This changes smoothly as a function of $N$ (see Fig. 1c), and no sharp "emergence" is observed.

Implication: The discontinuity lies in the step-function nature of the metric, not in the underlying improvement of the model as it scales.

3. Empirical Evidence: NLP and Vision Model Analyses

A. InstructGPT/GPT-3 Arithmetic Tasks

Under exact match accuracy, models display a sharp "emergent" jump on tasks such as 2-integer 2-digit multiplication as size increases.
When evaluated with token edit distance, performance improves monotonically and smoothly with model size, showing no abrupt transition.
With high-resolution statistics (using larger test sets), low-accuracy in small models is shown to be above chance, and the perceived transition is further smoothed.

B. BIG-Bench Meta-Analysis

Over 92% of claimed emergent abilities in BIG-Bench benchmarks only occur when using Nonlinear/Discontinuous metrics (Multiple Choice Grade, Exact String Match).
Applying continuous metrics (such as Brier Score) to the same models on the same tasks causes apparent "emergence" to vanish, revealing smooth improvement.
Emergence scores (a derivative of scaling curve steepness in the literature) frequently reflect metric artifacts rather than genuine cognitive leaps.

C. Artificial Emergence in Vision Models

By applying discontinuous metrics to computer vision models (e.g., autoencoders or classifiers), the authors are able to manufacture emergent-like transitions. For example:

In autoencoding, changing from MSE to a thresholded “reconstruction” score induces an emergent jump.
In sequence classification, requiring all items in a sequence to be correct (subset accuracy) creates artificial phase transitions.

This demonstrates the phenomenon is metric-driven and model-agnostic, not a property of LLMs alone.

Evaluation Domain	Nonlinear Metric	Observed Emergence?	Continuous Metric	Observed Emergence?
Arithmetic	Exact Match	Yes	Token Edit Distance	No
BIG-Bench	MC Grade, EM	Yes	Brier Score	No
Vision	Subset/Thresholding	Yes	MSE, Edit Distance	No

4. Statistical and Experimental Considerations

The illusion of emergence is compounded by statistical artifacts:

Resolution limitation: With coarse test sets and step-function metrics, small improvements in small models are invisible, artificially heightening the contrast at the "emergent" threshold. Larger datasets and finer-grained measurement attenuate illusory phase transitions.
Multiple comparisons: The proliferation of tasks, metrics, and models, especially in large benchmark suites, elevates the probability of finding spurious sharp transitions by chance.

5. Redefining Emergence: Implications for LLM and Benchmarks

The conclusions articulated in the paper have broad implications:

Distinguish Task from Metric: The presence or absence of emergence is not an intrinsic property of the task or model, but a function of metric construction. Researchers must rigorously separate which metric is being reported from the underlying task.
Caution in Metric Selection: Nonlinear and discontinuous metrics can artificially manufacture the appearance of phase transitions; continuous metrics more faithfully reflect true scaling behavior.
Reporting Practices: Reproducibility, transparency about metric selection, and reporting multiple metric variants are crucial for robust scientific progress in LLM scaling.
Benchmark and Model Design: Future benchmarks should favor metrics capable of capturing incremental improvement rather than arbitrarily binarizing outcomes, and should control for statistical artifacts via dataset size and multiple comparison correction.

6. Conceptual Reframing and Future Trajectory

The net effect of these findings is a paradigm shift from interpreting emergent abilities as mysterious, qualitative leaps to viewing them as artifacts of chosen evaluation procedures. The notion of LLM emergence is thereby tied as much to measurement methodology as to model dynamics, and must be scrutinized in both design and interpretation of future model evaluations. Careful attention to metric construction obviates many of the bold claims of unpredictability and discontinuity that have fueled speculative discourse about the cognitive or safety risks of next-generation LLMs.

Researchers are encouraged to adopt statistically robust, continuous, and transparent evaluation schemes as standard practice for all future work on emergent phenomena in LLMs. Emergent abilities should not, without such analysis, be treated as evidence of new forms of qualitative, unpredictable model cognition or “phase transitions” in behavior. Only with such methodological vigilance can the true properties of intelligent behavior in scaled neural systems be discerned and interpreted.

PDF Markdown Chat (Pro)

References (1)

Emergent Abilities of Large Language Models (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Emergent Abilities in LLMs.