Establish pretraining frequency–omission association beyond numbers (e.g., non-standard dialects)
Establish whether the frequency–omission association observed for numerical tokens—namely, that higher omission rates in reconstructions from pretrained Mamba state space language models correlate with lower counts per unique token in the pretraining corpus—generalizes to other types of information loss, including tokens from non-standard English dialects such as African American Vernacular English, using appropriate dialect taggers applied to The Pile (or Pile Uncopyrighted).
References
While we demonstrate a strong connection between the forgetting of numerical data and the occurrence of numbers in Mamba's training corpus, we are unable to establish a similar connection for all types of information loss, such as non-standard dialects. This is due to the lack of available taggers for dialects, in contrast to the well-established taggers for parts-of-speech.