Masked Next-Token Prediction (MNTP)

Updated 19 November 2025

Masked next-token prediction is defined by fitting a sigmoidal curve to model surprisal curves, determining the 50% acquisition threshold as a key metric.
The methodology tracks word predictions across training steps, allowing robust comparisons between different LM architectures using metrics such as perplexity and KL divergence.
Empirical results reveal that frequency and local distributional statistics drive acquisition in LMs, contrasting with human language learning which relies more on conceptual grounding.

Neural Age of Acquisition (nAoA) is a quantitative metric designed to characterize when neural LMs acquire the ability to predict individual words during training. Introduced by Chang & Bergen (2021), nAoA operationalizes the step in training, on a logarithmic scale, at which a model's uncertainty about a word decreases to a specific threshold—analogue to developmental age-of-acquisition metrics in child language research. By tracking the evolution of model predictions across hundreds of words during training, nAoA enables rigorous comparative analysis between different LM architectures and human word learning dynamics, providing diagnostic insight into the mechanisms of lexical acquisition in artificial systems (Chang et al., 2021).

1. Formal Definition of Neural Age of Acquisition

nAoA is defined through the learning curve of a LLM’s surprisal for a given word $w$ at training step $s$ :

Surprisal is computed as $S(w,s) = -\log_2 P_s(w)$ , where $P_s(w)$ is the model’s average predicted probability of $w$ , evaluated over up to 512 held-out contexts.
A sigmoidal curve, $\hat{S}(w,s) = L + \frac{U - L}{1 + \exp[-k(s - s_0)]}$ , is fit to the sequence of surprisals across training:
- $U$ : Upper asymptote (surprisal under a uniform-chance prediction),
- $L$ : Lower asymptote (minimal surprisal ever observed for $w$ ),
- $k$ : Slope parameter,
- $s$ 0: Midpoint of the sigmoid in training steps.
The acquisition cutoff surprisal is $s$ 1, set analogously to the “50% acquisition” threshold in child studies.
Neural Age of Acquisition for word $s$ 2 is $s$ 3, with $s$ 4 defined such that $s$ 5.

This provides a standardized, model- and context-agnostic scale for comparing acquisition timing of lexical items.

2. Computation and Modeling Workflow

To compute nAoA, Chang & Bergen (2021) utilized multiple LM architectures and a large, systematically chosen set of target words:

Models: Unidirectional LSTM (37M parameters), Bidirectional LSTM (51M), GPT-2-style Transformer (108M), and BERT base (109M), evaluated at perplexities 54.8, 9.0, 30.2, and 7.2, respectively.
Training Data: 25.6M sentence pairs (BookCorpus + WikiText-103), tokenized using a unigram-trained SentencePiece model (vocabulary ≈32,000 subword units, lowercased inputs).
Evaluation Regime: At ~200 checkpoints, more densely sampled earlier in training (100–1,000,000 steps), each target word $s$ 6 was masked in held-out contexts and average surprisal $s$ 7 was recorded.
Parameter Recovery: Sigmoid parameters { $s$ 8, $s$ 9, $S(w,s) = -\log_2 P_s(w)$ 0, $S(w,s) = -\log_2 P_s(w)$ 1} were fit via least squares to each word’s learning curve.
Vocabulary Coverage: Targeted 651 English items from the MacArthur–Bates Communicative Development Inventory (CDI), with 611 mapped to single SentencePiece tokens.

This protocol allows direct mapping from raw training dynamics to interpretable acquisition profiles for substantial subsets of the lexicon.

3. Predictors of nAoA in Neural LLMs

Linear regressions were applied to predict nAoA for each word, using the following predictors: log-frequency, mean length of utterance (MLU), word length (number of characters), concreteness, and lexical class. Key findings include:

Predictor	Effect in LMs	Effect in Children	Statistical Detail
Log-frequency	Strong negative ( $S(w,s) = -\log_2 P_s(w)$ 2, $S(w,s) = -\log_2 P_s(w)$ 3)	Weak ( $S(w,s) = -\log_2 P_s(w)$ 4)	More frequent words learned earlier
MLU (context length)	Positive ( $S(w,s) = -\log_2 P_s(w)$ 5; GPT-2, BiLSTM, BERT)	Positive	Words in longer utterances learned later ( $S(w,s) = -\log_2 P_s(w)$ 6), unidirectional LSTM: n.s.
Word length	Negative ( $S(w,s) = -\log_2 P_s(w)$ 7, $S(w,s) = -\log_2 P_s(w)$ 8)	Positive ( $S(w,s) = -\log_2 P_s(w)$ 9, $P_s(w)$ 0)	LMs: Longer words learned earlier; Children: shorter words earlier
Concreteness	No significant effect	Negative ( $P_s(w)$ 1, $P_s(w)$ 2)	Children acquire more concrete words earlier
Lexical class	Nouns and function words later (uniLSTM, GPT-2), no effect (biLMs)	Nouns < verbs < function words	p<.01; ordering reversed in children

This analysis reveals pronounced reliance of LMs on frequency and local distributional statistics, diverging from human reliance on semantic or conceptual content.

4. Characteristic Training Dynamics

Detailed tracking of masked-token distributions via average KL divergence with respect to various reference distributions at each checkpoint illuminates the acquisition process:

Early Training: The model’s output distribution for masked tokens collapses towards the corpus unigram distribution; the average KL divergence to the unigram baseline diminishes.
Intermediate Training: Model predictions transition to approximating bigram statistics; KL divergence to bigram statistics drops after the unigram phase.
Late Training: Divergence from unigram and bigram statistics increases, with model predictions converging on the full contextual, highly specific empirical distributions; KL divergence with respect to the one-hot “true” distribution decreases.

This sequential transition holds across all tested architectures, including both unidirectional and bidirectional models (Chang et al., 2021).

5. Interpretive Context: LLMs Versus Children

Comparisons with developmental child data reveal key contrasts:

LMs rely far more on frequency as a predictor of word acquisition (substantially higher $P_s(w)$ 3) than children.
The effect of mean utterance length is robust in both systems, with words in longer utterances being acquired later.
LMs show a reversed effect for word length and null effect for concreteness, contrary to child data—children acquire shorter and more concrete words earlier.
Lexical class effects in LMs diverge by architecture (strong noun delay in uniLSTM/GPT-2; null in biLMs), with opposite ordering to that observed in children.

These results underscore the importance of sensorimotor and social grounding in children, which is not captured by distributional learning from text alone.

6. Implications and Research Trajectories

The findings on nAoA carry broader implications for both the study and development of neural LLMs:

Distributional Mechanisms: LMs manifest a strong two-stage acquisition regime shaped by frequency and n-gram statistics, in contrast to the multifaceted grounding of child language learning.
Model Evaluation: nAoA profiles constitute a fine-grained diagnostic for pretraining regimes and architecture selection, supplementing aggregate metrics such as perplexity.
Toward Human-like LM Acquisition: Future research may explore integration of multimodal grounding (e.g., vision, action), interactive or curriculum-based learning, and alternative nAoA definitions (e.g., based on other surprisal quantiles or non-sigmoid fits).
Limitations: Current nAoA methodology is restricted to text-only training and does not inherently account for the conceptual or perceptual dimensions of word meaning.

This suggests that achieving human-like vocabulary acquisition in LMs may require explicit inclusion of sensorimotor grounding and functionally relevant learning objectives beyond next-token prediction (Chang et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Word Acquisition in Neural Language Models (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Masked Next-Token Prediction (MNTP).