Verify whether Jamrozik et al. (2020) supplementary materials were included in the model’s pretraining corpus

Ascertain whether the supplementary materials accompanying Jamrozik et al. (2020) that contain the legal pre-emption example were included in the pretraining corpus of the Gemini models evaluated in this study, to determine if the observed translation reflects reconstruction via pattern matching or retrieval of memorized text.

Background

In Section 2, the authors present a Jabberwockified passage that the model translates into content closely resembling a legal pre-emption example found in the supplementary materials of Jamrozik et al. (2020). They note that this resemblance could be due to the model reconstructing the text via learned patterns rather than retrieving memorized passages.

Because the training data of proprietary LLMs such as Gemini are not fully disclosed, the authors explicitly state that it is uncertain whether those specific supplementary materials were part of the model’s pretraining, which affects the interpretation of whether the result demonstrates retrieval or reconstruction.

References

We cannot know for certain whether these materials were included in the model’s pretraining.

— The unreasonable effectiveness of pattern matching (2601.11432 - Lupyan et al., 16 Jan 2026) in Section 2: From Jabberwocky to The Gostak; paragraph adjacent to the legal pre-emption example table (after the translation comparison)

Verify whether Jamrozik et al. (2020) supplementary materials were included in the model’s pretraining corpus

Background

References

Related Problems