Neural AoA: Measuring Word Acquisition in LMs

Updated 19 November 2025

Neural Age of Acquisition (nAoA) is a metric that quantifies when neural models acquire words by measuring the drop in surprisal via a sigmoidal curve fit.
It involves tracking word surprisal across training checkpoints to determine the step where the model achieves a 50% reduction between chance and its empirical minimum.
Analyses using nAoA reveal frequency biases and architectural differences, highlighting the limits of distributional learning compared to human word acquisition.

Neural Age of Acquisition (nAoA) quantifies the training dynamics of word learning in neural LMs by defining the point during training at which a model can be considered to have “acquired” a specific lexical item. Formally, for each target word $w$ , nAoA is the base-10 logarithm of the training step when the model’s surprisal prediction for $w$ achieves a value halfway between chance and its empirical minimum. This measure is obtained by fitting a sigmoidal curve to the observed learning curve of word surprisal over training, providing a direct analog to behavioral age of acquisition (AoA) metrics in developmental psycholinguistics. The construct was introduced by Chang and Bergen (2021), anchored through large-scale modeling of word acquisition using MacArthur–Bates Communicative Development Inventory vocabulary, with comparative analyses across both recurrent and transformer-based architectures (Chang et al., 2021).

1. Formal Definition of Neural Age of Acquisition

The nAoA metric stems from tracking model surprisal ( $S(w,s)$ ) for word $w$ at training step $s$ :

$S(w,s) = -\log_2 P_s(w)$

where $P_s(w)$ is the model-predicted probability of $w$ , averaged over up to 512 held-out contexts. The sequence $\{S(w,s_i)\}_i$ is fit with a sigmoidal (logistic) curve:

$\hat{S}(w,s) = L + \frac{U - L}{1 + \exp[-k(s - s_0)]}$

Parameters:

$U$ (“upper asymptote”): surprisal under uniform-chance prediction for $w$
$L$ (“lower asymptote”): minimum empirical surprisal reached for $w$
$k$ : slope parameter
$s_0$ : midpoint of the curve

The acquisition cutoff $S^* = (U + L)/2$ is analogous to the “50% acquisition” threshold from child studies. The nAoA for $w$ is then:

$\text{nAoA}(w) = \log_{10} s^* \quad \text{where} \quad \hat{S}(w,s^*) = S^*$

This approach provides a rigorous operationalization of lexical acquisition within neural architectures, suitable for comparative developmental and computational analyses.

2. Methodological Framework for nAoA Estimation

The computation of nAoA involves systematic tracking of word surprisal during LM training on large text corpora. Models assessed include unidirectional LSTM, bidirectional LSTM, GPT-2-style transformer, and BERT-style transformer architectures.

Key workflow components:

Training set: 25.6 million sentence pairs (BookCorpus + WikiText-103), tokenized using a unigram SentencePiece model (vocab ≈32,000 subwords), with all inputs lowercased.
Evaluation contexts: 5.8 million held-out sentence pairs.
Target vocabulary: 651 MacArthur–Bates CDI items, with 611 present as single SentencePiece tokens.
Training checkpoints: ~200 sampled steps, denser sampling early (100 – 1,000,000 steps).
Masking protocol: target word $w$ is masked in held-out contexts; model’s mean surprisal is recorded at each checkpoint.
Fit procedure: least-squares minimization of MSE for the four sigmoid parameters.

Representative model scale and performance: | Model | Params (M) | Eval Perplexity (PPL) | |-----------------|:----------:|:---------------------:| | LSTM | 37 | 54.8 | | GPT-2-style | 108 | 30.2 | | BiLSTM | 51 | 9.0 | | BERT base | 109 | 7.2 |

The acquisition cutoff $S^*$ , set at the midpoint between $U$ and $L$ , operationalizes nAoA as the step where half the surprisal gap is closed, mirroring psycholinguistic benchmarks.

3. Key Predictors of nAoA and Comparative Findings

Linear regression analyses were performed to identify factors predicting nAoA(w), spanning log-frequency, mean utterance length (MLU), word length (n-chars), concreteness, and lexical class. The primary effects are summarized below.

Predictor	Effect in LMs	Effect in Children
Log-frequency	Strong negative	Weak, non-significant
MLU	Positive (except unidir LSTM)	Positive (matches LM pattern)
Word length	Negative (longer earlier)	Positive (shorter earlier)
Concreteness	No effect	Negative (concrete earlier)
Lexical class	Nouns, function later than verbs/adj (unidir LSTM/GPT-2); no effect (biLSTM, BERT)	Nouns < verbs < function (opposite)

Key empirical findings:

Log-frequency: More frequent words are acquired much earlier by LMs ( $\beta_\text{freq} < 0$ ; Adj. $R^2$ 0.91–0.94), unlike children (Adj. $R^2 = 0.01$ ).
MLU: Words appearing in longer contexts are learned later ( $\beta_\text{MLU} > 0$ ), matching child patterns except for unidirectional LSTM.
Word length: Longer words are acquired earlier in LMs ( $\beta_\text{chars} < 0$ ), contrary to child data ( $\beta_\text{chars} > 0$ ).
Concreteness: No significant effect in LMs, whereas children acquire more concrete words earlier ( $\beta_\text{conc} < 0$ ).
Lexical class: Discrepant patterns between LM families and child data in acquisition order of nouns, verbs, and function words.

These results highlight the distributional biases of LMs and underscore the divergence from human acquisition mechanisms, especially regarding grounding and conceptual connectivity.

4. Training Dynamics: Unigram, Bigram, and Contextual Phases

Analysis of training checkpoints, using average KL divergence of model predictions versus various baselines, reveals structured acquisition regimes across all architectures:

Unigram phase (early): Model predictions collapse towards corpus unigram statistics (KL to unigram ↓).
Bigram phase (intermediate): Predictions approximate bigram probabilities, with KL to bigram ↓ after the unigram phase.
Full-context phase (late): Predictions diverge from first- and second-order statistics, emphasizing full contextual integration (loss KL ↓ to one-hot reference).

This sequence recapitulates key transitions in statistical learning within neural LLMs and illuminates the role of corpus-derived n-gram statistics in driving nAoA profiles.

5. Implications for Model Design and Human-Like Acquisition

Distributional learning alone is insufficient for reproducing key human acquisition patterns, notably in effects of word concreteness and lexical class. LM nAoAs are overdetermined by word frequency and context length, reflecting the boundary conditions of text-only, non-grounded data streams.

Significant implications:

Distributional vs. grounded learning: Children leverage sensorimotor grounding, social cues, and conceptual structure, leading to more balanced and ecologically valid word acquisition sequences.
Model advancement: Incorporating multimodal grounding (e.g., vision, action), interactive objectives, or curriculum learning may align model nAoA trajectories more closely with those observed in human learners.
Metric utility: nAoA profiles can function as fine-grained diagnostics for pretraining regimes and architectural choices, beyond the scope of common perplexity measures.
Research directions: Alternative cutoff strategies, non‐sigmoidal curve fits, or aggregation over multiple contexts present avenues for refinement of nAoA estimation and interpretation.

A plausible implication is that multimodal or socially interactive learning designs could mitigate existing nAoA–AoA discrepancies.

6. Role of nAoA in Evaluating LLM Pretraining Regimes

nAoA offers discriminative power for assessing LM training strategies, architectures, and related pretraining regimes. Profiles of nAoA for diverse lexical items expose subtle variances in acquisition dynamics, impact of architectural choices, and training paradigms. This suggests its application as a diagnostic framework to optimize LM learning toward more human-like patterns, extending beyond standard computational metrics such as perplexity.

In summary, neural Age of Acquisition situates word learning trajectories within principled psycholinguistic and computational frameworks, demonstrating lawful but frequency-driven acquisition in LMs and motivating future research toward models exhibiting more human-aligned lexical development (Chang et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Word Acquisition in Neural Language Models (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neural Age of Acquisition (nAoA).

Neural AoA: Measuring Word Acquisition in LMs

1. Formal Definition of Neural Age of Acquisition

2. Methodological Framework for nAoA Estimation

3. Key Predictors of nAoA and Comparative Findings

4. Training Dynamics: Unigram, Bigram, and Contextual Phases

5. Implications for Model Design and Human-Like Acquisition

6. Role of nAoA in Evaluating LLM Pretraining Regimes

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Neural AoA: Measuring Word Acquisition in LMs

1. Formal Definition of Neural Age of Acquisition

2. Methodological Framework for nAoA Estimation

3. Key Predictors of nAoA and Comparative Findings

4. Training Dynamics: Unigram, Bigram, and Contextual Phases

5. Implications for Model Design and Human-Like Acquisition

6. Role of nAoA in Evaluating LLM Pretraining Regimes

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research