Neural AoA: Measuring Word Acquisition in LMs
- Neural Age of Acquisition (nAoA) is a metric that quantifies when neural models acquire words by measuring the drop in surprisal via a sigmoidal curve fit.
- It involves tracking word surprisal across training checkpoints to determine the step where the model achieves a 50% reduction between chance and its empirical minimum.
- Analyses using nAoA reveal frequency biases and architectural differences, highlighting the limits of distributional learning compared to human word acquisition.
Neural Age of Acquisition (nAoA) quantifies the training dynamics of word learning in neural LMs by defining the point during training at which a model can be considered to have “acquired” a specific lexical item. Formally, for each target word , nAoA is the base-10 logarithm of the training step when the model’s surprisal prediction for achieves a value halfway between chance and its empirical minimum. This measure is obtained by fitting a sigmoidal curve to the observed learning curve of word surprisal over training, providing a direct analog to behavioral age of acquisition (AoA) metrics in developmental psycholinguistics. The construct was introduced by Chang and Bergen (2021), anchored through large-scale modeling of word acquisition using MacArthur–Bates Communicative Development Inventory vocabulary, with comparative analyses across both recurrent and transformer-based architectures (Chang et al., 2021).
1. Formal Definition of Neural Age of Acquisition
The nAoA metric stems from tracking model surprisal () for word at training step :
where is the model-predicted probability of , averaged over up to 512 held-out contexts. The sequence is fit with a sigmoidal (logistic) curve:
Parameters:
- (“upper asymptote”): surprisal under uniform-chance prediction for
- (“lower asymptote”): minimum empirical surprisal reached for
- : slope parameter
- : midpoint of the curve
The acquisition cutoff is analogous to the “50% acquisition” threshold from child studies. The nAoA for is then:
This approach provides a rigorous operationalization of lexical acquisition within neural architectures, suitable for comparative developmental and computational analyses.
2. Methodological Framework for nAoA Estimation
The computation of nAoA involves systematic tracking of word surprisal during LM training on large text corpora. Models assessed include unidirectional LSTM, bidirectional LSTM, GPT-2-style transformer, and BERT-style transformer architectures.
Key workflow components:
- Training set: 25.6 million sentence pairs (BookCorpus + WikiText-103), tokenized using a unigram SentencePiece model (vocab ≈32,000 subwords), with all inputs lowercased.
- Evaluation contexts: 5.8 million held-out sentence pairs.
- Target vocabulary: 651 MacArthur–Bates CDI items, with 611 present as single SentencePiece tokens.
- Training checkpoints: ~200 sampled steps, denser sampling early (100 – 1,000,000 steps).
- Masking protocol: target word is masked in held-out contexts; model’s mean surprisal is recorded at each checkpoint.
- Fit procedure: least-squares minimization of MSE for the four sigmoid parameters.
Representative model scale and performance: | Model | Params (M) | Eval Perplexity (PPL) | |-----------------|:----------:|:---------------------:| | LSTM | 37 | 54.8 | | GPT-2-style | 108 | 30.2 | | BiLSTM | 51 | 9.0 | | BERT base | 109 | 7.2 |
The acquisition cutoff , set at the midpoint between and , operationalizes nAoA as the step where half the surprisal gap is closed, mirroring psycholinguistic benchmarks.
3. Key Predictors of nAoA and Comparative Findings
Linear regression analyses were performed to identify factors predicting nAoA(w), spanning log-frequency, mean utterance length (MLU), word length (n-chars), concreteness, and lexical class. The primary effects are summarized below.
| Predictor | Effect in LMs | Effect in Children |
|---|---|---|
| Log-frequency | Strong negative | Weak, non-significant |
| MLU | Positive (except unidir LSTM) | Positive (matches LM pattern) |
| Word length | Negative (longer earlier) | Positive (shorter earlier) |
| Concreteness | No effect | Negative (concrete earlier) |
| Lexical class | Nouns, function later than verbs/adj (unidir LSTM/GPT-2); no effect (biLSTM, BERT) | Nouns < verbs < function (opposite) |
Key empirical findings:
- Log-frequency: More frequent words are acquired much earlier by LMs (; Adj. 0.91–0.94), unlike children (Adj. ).
- MLU: Words appearing in longer contexts are learned later (), matching child patterns except for unidirectional LSTM.
- Word length: Longer words are acquired earlier in LMs (), contrary to child data ().
- Concreteness: No significant effect in LMs, whereas children acquire more concrete words earlier ().
- Lexical class: Discrepant patterns between LM families and child data in acquisition order of nouns, verbs, and function words.
These results highlight the distributional biases of LMs and underscore the divergence from human acquisition mechanisms, especially regarding grounding and conceptual connectivity.
4. Training Dynamics: Unigram, Bigram, and Contextual Phases
Analysis of training checkpoints, using average KL divergence of model predictions versus various baselines, reveals structured acquisition regimes across all architectures:
- Unigram phase (early): Model predictions collapse towards corpus unigram statistics (KL to unigram ↓).
- Bigram phase (intermediate): Predictions approximate bigram probabilities, with KL to bigram ↓ after the unigram phase.
- Full-context phase (late): Predictions diverge from first- and second-order statistics, emphasizing full contextual integration (loss KL ↓ to one-hot reference).
This sequence recapitulates key transitions in statistical learning within neural LLMs and illuminates the role of corpus-derived n-gram statistics in driving nAoA profiles.
5. Implications for Model Design and Human-Like Acquisition
Distributional learning alone is insufficient for reproducing key human acquisition patterns, notably in effects of word concreteness and lexical class. LM nAoAs are overdetermined by word frequency and context length, reflecting the boundary conditions of text-only, non-grounded data streams.
Significant implications:
- Distributional vs. grounded learning: Children leverage sensorimotor grounding, social cues, and conceptual structure, leading to more balanced and ecologically valid word acquisition sequences.
- Model advancement: Incorporating multimodal grounding (e.g., vision, action), interactive objectives, or curriculum learning may align model nAoA trajectories more closely with those observed in human learners.
- Metric utility: nAoA profiles can function as fine-grained diagnostics for pretraining regimes and architectural choices, beyond the scope of common perplexity measures.
- Research directions: Alternative cutoff strategies, non‐sigmoidal curve fits, or aggregation over multiple contexts present avenues for refinement of nAoA estimation and interpretation.
A plausible implication is that multimodal or socially interactive learning designs could mitigate existing nAoA–AoA discrepancies.
6. Role of nAoA in Evaluating LLM Pretraining Regimes
nAoA offers discriminative power for assessing LM training strategies, architectures, and related pretraining regimes. Profiles of nAoA for diverse lexical items expose subtle variances in acquisition dynamics, impact of architectural choices, and training paradigms. This suggests its application as a diagnostic framework to optimize LM learning toward more human-like patterns, extending beyond standard computational metrics such as perplexity.
In summary, neural Age of Acquisition situates word learning trajectories within principled psycholinguistic and computational frameworks, demonstrating lawful but frequency-driven acquisition in LMs and motivating future research toward models exhibiting more human-aligned lexical development (Chang et al., 2021).