Explain bimodal similarity in non-CC-licensed Grokipedia articles

Determine the underlying cause(s) of the bimodal distribution observed in per-article average chunk cosine similarity between non-Creative-Commons-licensed Grokipedia articles and their corresponding English Wikipedia articles; rigorously evaluate whether article length or chunk position explains the higher-similarity mode.

Background

Within the similarity analysis, the authors report that non-CC-licensed Grokipedia entries exhibit a bimodal distribution of per-article average chunk cosine similarity to their English Wikipedia equivalents (mean ≈ 0.77), while CC-licensed entries concentrate at higher similarity (mean ≈ 0.90).

They note that chunk similarity tends to be highest in introductions and declines with position, and tentatively speculate that shorter non-CC-licensed articles might account for the higher-similarity mode. However, the underlying reasons for the bimodality remain unknown.

References

We do not know precisely why the non-CC-licensed entry distribution shows bimodality, but speculate that the higher peak corresponds to shorter non-CC-licensed articles.

What did Elon change? A comprehensive analysis of Grokipedia  (2511.09685 - Triedman et al., 12 Nov 2025) in Section 3.2 Similarity