Determine the causes of cross-model repetition in open-ended LLM outputs

Determine the specific causes of the high semantic similarity and verbatim overlap observed across different large language models when generating responses to open-ended queries, especially identifying whether shared pretraining data pipelines across regions or contamination from synthetic data are responsible, given that the exact causes are currently unclear due to proprietary training details.

Background

The paper analyzes inter-model homogeneity across 25 major models using the Infinity-Chat dataset and finds that different models often produce strikingly similar responses to fully open-ended prompts, including instances of identical phrasing and high average pairwise embedding similarity. This convergence occurs across families (e.g., OpenAI GPT, Qwen, DeepSeek, Mistral) and is robust across prompts and sampling settings.

Despite extensive empirical evidence of this "Artificial Hivemind" effect, the authors explicitly state that the causes of such cross-model repetition remain unclear, potentially due to proprietary training and alignment pipelines. They suggest possible explanations (shared data pipelines or synthetic data contamination) and call for rigorous investigation to identify the actual sources of the observed convergence.

References

Although the exact causes remain unclear due to proprietary training details, possible explanations include shared data pipelines across regions or contamination from synthetic data. We highlight the need for future work to rigorously investigate the sources of such cross-model repetition.

— Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) (2510.22954 - Jiang et al., 27 Oct 2025) in Inter-model homogeneity, Section "Artificial Hivemind: Intra- and Inter-Model Homogeneity in LMs"

Determine the causes of cross-model repetition in open-ended LLM outputs

Sponsor

Background

References

Related Problems