Establish pretraining frequency–omission association beyond numbers (e.g., non-standard dialects)

Establish whether the frequency–omission association observed for numerical tokens—namely, that higher omission rates in reconstructions from pretrained Mamba state space language models correlate with lower counts per unique token in the pretraining corpus—generalizes to other types of information loss, including tokens from non-standard English dialects such as African American Vernacular English, using appropriate dialect taggers applied to The Pile (or Pile Uncopyrighted).

Background

The paper analyzes selective memory in pretrained Mamba LLMs by training auto-encoders to reconstruct inputs from hidden states and measuring omissions. It finds that numerical tokens and certain named entities are disproportionately forgotten. A training corpus analysis on a Pile Uncopyrighted sample shows an association between omission rates and pretraining frequencies for numbers (NUM), suggesting rarity contributes to forgetting. However, due to the lack of dialect taggers, the authors were unable to test whether similar frequency–omission relationships hold for other information types, particularly non-standard English dialects such as AAVE. Establishing this connection would clarify whether the observed pattern for numbers extends to broader categories of information loss.

References

While we demonstrate a strong connection between the forgetting of numerical data and the occurrence of numbers in Mamba's training corpus, we are unable to establish a similar connection for all types of information loss, such as non-standard dialects. This is due to the lack of available taggers for dialects, in contrast to the well-established taggers for parts-of-speech.

Characterizing Mamba's Selective Memory using Auto-Encoders  (2512.15653 - Hossain et al., 17 Dec 2025) in Limitations, Training Corpus Analysis