Determine the causes of cross-model repetition in open-ended LLM outputs
Determine the specific causes of the high semantic similarity and verbatim overlap observed across different large language models when generating responses to open-ended queries, especially identifying whether shared pretraining data pipelines across regions or contamination from synthetic data are responsible, given that the exact causes are currently unclear due to proprietary training details.
References
Although the exact causes remain unclear due to proprietary training details, possible explanations include shared data pipelines across regions or contamination from synthetic data. We highlight the need for future work to rigorously investigate the sources of such cross-model repetition.
— Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
(2510.22954 - Jiang et al., 27 Oct 2025) in Inter-model homogeneity, Section "Artificial Hivemind: Intra- and Inter-Model Homogeneity in LMs"