Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) (2510.22954v1)

Published 27 Oct 2025 in cs.CL

Abstract: LLMs (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. We introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., brainstorm & ideation) that further breaks down to 17 subcategories. Using Infinity-Chat, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.

Summary

The paper introduces the Infinity-Chat dataset of over 26,000 real-world queries to analyze open-ended prompt diversity.
The paper quantifies high intra-model repetition and inter-model homogeneity, with 79% of responses exceeding a similarity of 0.8.
The paper highlights implications for AI creativity and urges the adoption of diversity-aware training and evaluation methods.

Artificial Hivemind: The Open-Ended Homogeneity of LLMs

Introduction and Motivation

The paper "Artificial Hivemind: The Open-Ended Homogeneity of LLMs (and Beyond)" (2510.22954) presents a systematic investigation into the diversity collapse phenomenon in LLMs when faced with real-world, open-ended queries. The authors introduce the Infinity-Chat dataset, a large-scale resource of 26,000 naturally occurring, open-ended user queries, and develop a comprehensive taxonomy of prompt types. The paper reveals pronounced intra-model repetition and inter-model homogeneity—termed the "Artificial Hivemind" effect—where both individual models and ensembles converge on highly similar outputs, even for queries with vast plausible answer spaces.

Infinity-Chat: Dataset and Taxonomy

Infinity-Chat is constructed by mining and refining user queries from WildChat, filtering for English, non-toxic, single-turn prompts directed at GPT-4. The resulting dataset comprises 26,070 open-ended and 8,817 closed-ended queries. The authors develop a hierarchical taxonomy of open-ended queries, spanning 6 top-level and 17 fine-grained categories, including creative content generation, ideation, speculative scenarios, skill development, and analytical questions.

Figure 1: Taxonomy of real-world open-ended queries, showing the distribution and diversity of prompt types in user-chatbot interactions.

The taxonomy reveals that while creative content generation dominates, substantial fractions of queries involve ideation, concept explanation, skill development, and analytical reasoning. Notably, 15.2% of queries are direct requests for brainstorming, underscoring the risk of homogenized thinking if LLM outputs lack diversity.

Intra-Model and Inter-Model Homogeneity

Intra-Model Repetition

The paper quantifies intra-model repetition by generating 50 responses per query for 100 representative open-ended prompts across 25 models, using top-p sampling ( $p=0.9$ ) and temperature $=1.0$ . Sentence embedding similarities (OpenAI text-embedding-3-small) reveal that, for most models, the average pairwise similarity among responses to the same prompt exceeds 0.8 in 79% of cases.

Figure 2: Heatmap of intra-model repetition, showing high average pairwise similarity (>0.8) for most queries across models.

Even with diversity-oriented decoding strategies such as min-p sampling, mode collapse persists, with 61.2% of response pairs still exceeding 0.8 similarity. This demonstrates that decoding alone is insufficient to induce genuine diversity in open-ended generation.

Inter-Model Homogeneity

The authors extend the analysis to inter-model homogeneity, comparing responses across different model families and sizes. The average pairwise similarity between responses from different models ranges from 71% to 82%, with frequent verbatim overlaps and shared core ideas.

Figure 3: Inter-model similarity heatmap and qualitative examples, illustrating substantial semantic and surface-level overlap across model outputs.

Instance-level analysis shows that models often produce identical or near-identical responses to open-ended queries, even when the output space is large. Clustering of sentence embeddings for queries such as "Write a metaphor about time" reveals that responses from 25 models form just two primary clusters: one centered on "time is a river" and another on "time is a weaver."

Figure 4: PCA clustering of responses to "Write a metaphor about time"—most models converge on two dominant metaphors.

Similar clustering patterns are observed for other open-ended queries, including philosophical, factual, and creative prompts.

Figure 5: Clustering of responses to "Name one meaning of life," showing prominent overlap across models.

Figure 6: Clustering of responses to "Provide a few sentences on Sisu Cinema Robotics," indicating substantial cross-model overlap.

Figure 7: Clustering of responses to "Write a short story about a colorful toad goes on an adventure," again revealing strong homogeneity.

Figure 8: Clustering of responses to "Generate a joke about electric vehicles," with most models producing similar jokes.

Human Annotation and Pluralistic Preferences

To assess whether models and reward functions are calibrated to human preferences, the authors collect 31,250 human annotations (absolute ratings and pairwise preferences) with 25 independent annotators per example. Analysis of annotation entropy shows high disagreement among annotators for many open-ended queries, indicating that multiple responses can be equally valid and high-quality.

Figure 9: Distribution of Shannon entropy in pairwise preference annotations, illustrating substantial annotator disagreement for open-ended queries.

Figure 10: Distribution of entropy in absolute ratings, further highlighting the diversity of human preferences.

Model Calibration to Human Judgments

The paper compares model-generated scores (LM perplexity, reward model outputs, LM judge ratings) to average human ratings. Correlations are strong for the full set but drop significantly for subsets where human ratings are similar (i.e., multiple responses are equally good) or where annotator disagreement is high.

Figure 11: Spearman's correlation between human and model absolute ratings, showing weaker alignment for similar-quality and high-disagreement subsets.

Figure 12: Spearman's correlation for pairwise preference ratings, again indicating reduced alignment in contested or subtle quality differences.

This miscalibration exposes a limitation in current reward modeling and evaluation pipelines, which tend to enforce a monolithic notion of quality and fail to reward pluralistic, idiosyncratic human preferences.

Mechanisms and Implications

The Artificial Hivemind effect is attributed to overlapping pretraining data, alignment pipelines, and possibly synthetic data contamination. The findings challenge the assumption that model ensembles or swarms inherently yield diversity, as cross-model convergence is prevalent even among state-of-the-art models.

The implications are significant for AI safety, creativity, and cultural representation. Homogenized outputs risk narrowing the scope of human expression, propagating dominant cultural metaphors, and suppressing minority perspectives. The effect also undermines the utility of LLMs for ideation, brainstorming, and creative tasks, where diversity is essential.

Future Directions

The paper outlines several avenues for mitigating diversity collapse:

Diversity-aware training objectives: Explicitly reward exploration of multiple valid modes during alignment.
Benchmarking decoding strategies: Systematic evaluation of diverse beam search, nucleus sampling variants, and other methods under the Artificial Hivemind metric.
Red-teaming and curriculum design: Use open-ended prompts to stress-test model robustness and gradually expose models to queries most susceptible to mode collapse.
Extension to multilingual and multicultural settings: Investigate whether similar homogenization arises across languages and cultures.

Conclusion

Infinity-Chat provides a comprehensive resource for diagnosing and benchmarking diversity collapse in LLMs. The Artificial Hivemind effect—manifested as both intra-model repetition and inter-model homogeneity—poses a critical challenge for the development of expressive, creative, and pluralistically aligned AI systems. Addressing this challenge requires innovations in training, decoding, and evaluation, as well as sustained ethical reflection on the societal impact of AI-mediated homogenization.

References

The paper situates its findings within a broad literature on diversity collapse (Dohmatob et al., 7 Oct 2024, Wang et al., 19 Dec 2024, Shumailov et al., 2023), creativity evaluation (Ismayilzada et al., 22 Oct 2024, Wenger et al., 31 Jan 2025), pluralistic alignment (Sorensen et al., 7 Feb 2024, Feng et al., 22 Jun 2024), and the societal impact of LLMs [3635636.3656204, (Sourati et al., 16 Feb 2025)]. The Infinity-Chat dataset and taxonomy are positioned as foundational tools for future research in these domains.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper looks at how today’s AI LLMs often give very similar answers to creative, open-ended questions. The authors call this problem the “Artificial Hivemind,” because many different models start to “think” in the same way, like a single hive mind. They build a new dataset called Infinity-Chat to paper this problem and to help future researchers make AI that stays diverse and more human-like in its ideas.

What questions did the researchers ask?

They focused on three simple questions:

What kinds of open-ended questions do people actually ask AI?
Do single AI models keep repeating themselves on these open-ended questions?
Do different AI models end up giving very similar answers, even when there’s no single “correct” answer?
How well do AI “judges” and reward models match human opinions when people disagree or when multiple answers are equally good?

How did they paper it?

Building a real-world question collection (Infinity-Chat)

The team collected 26,070 real, open-ended questions from a large public set of chatbot logs. “Open-ended” means there can be many reasonable answers with no single best one, like “Write a metaphor about time” or “Give ideas for a story.” They also kept 8,817 closed-ended questions for comparison.

Categorizing questions (a taxonomy)

They created a system (a taxonomy) to organize these open-ended questions:

6 big categories and 17 smaller subcategories, including creative writing, brainstorming, skill-building, explanations, analysis, and “what-if” scenarios.
They even discovered 314 new, underexplored types, showing how varied real user requests can be.

Testing many AI models and measuring similarity

They picked 100 representative open-ended questions (Infinity-Chat100) and had 70+ AI models (25 reported in detail) each generate 50 answers per question. Then they measured how similar the answers were.

How they measured similarity:

They turned each sentence into a list of numbers (called a “sentence embedding”). Think of it like turning a song into a “fingerprint” so you can compare two songs without listening to them.
If two answers have high similarity, they’re basically saying the same thing, even if the words are slightly different.

How they encouraged variety:

They used “top-p” and “temperature” settings, which are like turning up the “randomness” dial so the model tries more creative options.
They also tried “min-p sampling,” a technique designed to push models toward more varied, creative outputs.

Gathering lots of human opinions

They didn’t just rely on machines. They collected 31,250 human ratings:

Absolute ratings: How good is this answer on a 1–5 scale?
Pairwise preferences: Given two answers, which one do you prefer?
Each item got 25 independent human ratings, which is a lot. This helps capture real differences in taste and avoids assuming everyone agrees.

They used “Shannon entropy” to measure disagreement. In plain terms, if votes are split, disagreement is high; if almost everyone picks the same option, disagreement is low.

Comparing model judges to humans

They checked how well different scoring systems matched human opinions:

LM scores: A LLM’s internal “confidence,” often measured by how surprising an answer is (perplexity).
Reward models: Special AI systems that give a score to an answer and are used to train other models.
LM judges: Large models prompted to rate answers using rules like “Helpfulness, Harmlessness, Honesty (HHH)” or overall quality.

They looked at whether these machine ratings stayed in sync with human ratings, especially when humans saw multiple answers as equally good or when humans strongly disagreed.

What did they find and why it matters?

Here are the key results:

Single models repeat themselves: Even with high randomness settings, one model kept producing very similar answers to open-ended questions. For most cases, average similarity between answers from the same model was very high (often above 0.8 on a 0–1 scale).
Different models converge on the same ideas: Many models from different families gave strikingly similar answers. For example, when asked to “Write a metaphor about time,” answers clustered into just two big ideas: “time is a river” and “time is a weaver.”
Verbatim overlaps happen: Models sometimes produced near-identical phrases or even identical slogans for open-ended tasks, showing deep sameness across brands and sizes.
Humans don’t always agree—and that’s normal: With open-ended questions, people often prefer different answers. The data showed lots of disagreement, which is expected when creativity and taste are involved.
AI judges and reward models are less reliable when humans disagree or when answers are similarly good: When humans saw two answers as equally strong, model scoring often failed to reflect that. Correlations between machine scores and human ratings dropped in these tricky cases.

Why this is important:

If AI keeps giving similar ideas, creativity can get stuck. People might see the same suggestions over and over, which could slowly make our thinking more uniform and less original.
Training models using one-size-fits-all “quality” scores can miss the fact that many different answers are valid. That can accidentally punish diversity.

What does this mean for the future?

We need training methods and data that encourage true variety, not just minor word changes. Techniques should go beyond decoding tricks and address diversity during training.
AI judges and reward models should be designed to recognize multiple equally good answers and handle disagreement better. Instead of chasing a single “best” response, they should understand and support pluralism—different people liking different styles.
Infinity-Chat provides a benchmark and a roadmap: realistic open-ended prompts, a clear categorization system, and dense human ratings. Future researchers can use it to build safer, more expressive AI that sparks creativity rather than flattening it.

In short, this paper shows that many AI models behave like an “Artificial Hivemind”—they tend to think alike. The new Infinity-Chat dataset and analysis point the way toward AI that supports diverse ideas, matches real human preferences, and helps people be more creative, not less.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of concrete gaps the paper leaves unresolved, which future work can address to strengthen the evidence, broaden applicability, and develop mitigations.

Dataset scope is limited to single-turn, English, non-toxic GPT-4 chats; unclear generalization to multi-turn conversations, other languages, code or multimodal prompts, and higher-risk content domains.
Open-endedness labeling and taxonomy construction rely heavily on GPT-4o with limited human verification (only Infinity-Chat100); systematic human validation of the full 26K query taxonomy and category assignments is missing.
Ambiguous queries were revised for clarity by GPT-4o; the effects of these model-mediated revisions on prompt diversity and downstream homogeneity are not quantified.
The taxonomy’s robustness (inter-annotator agreement, stability across annotator cohorts, and cross-lingual transfer) is not reported; criteria for merging, splitting, and hierarchies of the 314 “novel” categories are not operationalized.
Infinity-Chat mining from WildChat may inherit platform-specific and user demographic biases; no audit or correction is provided to ensure representativeness across populations, regions, and usage contexts.
Inter- and intra-model homogeneity analyses are conducted on a 100-query subset; it is unclear whether the same conclusions hold across the full 26K queries and across all taxonomy subcategories.
Similarity is measured via a single embedding model (OpenAI text-embedding-3-small); dependence on embedding choice, dimensionality reduction method (PCA), and thresholding is not tested against alternative metrics (e.g., multiple embedding families, topic/idea-level measures, semantic frames, lexical distinctness, diversity indices).
Reported average similarities (e.g., >0.8) lack calibration against task-specific baselines; it remains uncertain whether high embedding similarity reflects true idea-level convergence versus artifacts of length, style, or embedding saturation.
The paper illustrates verbatim overlap with qualitative examples but does not quantify phrase-level duplication rates (e.g., shared n-grams, templates) systematically across models and categories.
Causes of inter-model homogeneity are hypothesized (shared pretraining corpora, alignment priors, synthetic data contamination) but not disentangled with controlled experiments (e.g., training-data lineage audits, ablations of alignment steps, synthetic vs. human data proportions).
The impact of system prompts, chat templates, instruction tuning style, and alignment reward schemas on homogeneity is not isolated; controlled tests varying these factors are missing.
Decoding exploration is narrow (top-p, temperature, min-p); broader comparisons (top-k, repetition penalties, diverse beam search, sampling mixtures, anti-template constraints, stochastic prompting strategies) and parameter sweeps are not performed.
No analysis of category-specific sensitivity: which prompt types (e.g., brainstorm vs. speculative vs. skill development) exhibit more/less homogeneity and why.
Paraphrase invariance is illustrated but not systematically quantified; a controlled paraphrase suite with semantic equivalence checks and style perturbations is needed to assess prompt robustness.
The unique-model-count metric for top-N similarity clusters lacks statistical benchmarking against null models and single-model baselines; significance testing and confidence intervals are absent.
Human annotation cohort details (demographics, expertise, culture/language, preference profiles) are not reported; without diversity characterization, findings about “idiosyncratic preferences” are hard to interpret or personalize.
Average human ratings are used as consensus proxies; modeling multimodal preference distributions (e.g., mixture models, per-annotator latent factors) is not attempted and could better capture pluralism.
Correlation analyses (Pearson/Spearman) assume linear/monotonic relationships; calibration, ranking consistency (Kendall’s tau), reliability (e.g., ECE), and uncertainty-aware evaluations are not explored.
LM “scores” via perplexity are under-specified (which LM, tokenization, normalization) and may be poor proxies for quality in open-ended generation; alternative, validity-tested scoring frameworks are needed.
Reward models and LM judges show misalignment in similar-quality and high-disagreement cases, but no concrete training interventions (e.g., distributional RL, preference regularization, multi-objective optimization) are implemented or evaluated.
No longitudinal or user-level studies quantify the hypothesized homogenization effects on human creativity and ideation over time; empirical evidence of downstream societal impact remains absent.
Ensembles are cautioned but not tested; designing and evaluating ensembles with deliberately non-overlapping priors (orthogonal pretraining corpora, distinct alignment pipelines, cultural/linguistic diversity) is left open.
Retrieval-augmented generation, tool-use, and external knowledge integration are not assessed for their potential to mitigate homogeneity.
The quality–diversity trade-off is asserted but not quantified; frameworks to jointly optimize and evaluate novelty, coherence, and utility are missing.
Replicability risks exist due to closed APIs, changing model versions, and unknown training data; versioning, seeds, and reproducibility protocols are not documented.
Potential memorization is not tested; near-duplicate detection against public corpora and synthetic datasets could clarify the role of training-data overlap in observed homogeneity.
Safety implications are discussed conceptually but not empirically; how diversity interventions interact with harmlessness/honesty constraints (HHH) needs measurement and governance guidance.
Cross-lingual, cross-cultural, and low-resource settings are not studied; assessing whether homogeneity worsens or improves across languages and cultures is an open question.
Multi-turn contexts, user state, and personalization are out of scope; how conversational history and per-user profiles affect diversity and preference alignment remains unexplored.

View Paper Prompt View All Prompts

Practical Applications

Overview

The paper introduces Infinity-Chat, a large-scale dataset of 26K real-world, open-ended queries and a taxonomy of query types, plus dense human annotations (31,250 labels with 25 raters per item). Using this dataset, the authors identify a pronounced Artificial Hivemind effect—both intra-model repetition and inter-model homogeneity—even under high-stochasticity and diversity-oriented decoding. The paper also shows that LMs, reward models, and LM judges are miscalibrated on alternative responses that humans rate similarly or that elicit high annotator disagreement. These contributions enable practical applications in evaluation, product design, training, policy, and user workflows.

Below are actionable applications grouped by time horizon.

Immediate Applications

The following applications can be deployed now using existing tools, the Infinity-Chat dataset, and standard evaluation methods described in the paper.

[Software, Platforms] Diversity Audit and Monitoring Toolkit for LLM Products
- Samples multiple outputs per open-ended prompt (e.g., 50 per prompt from Infinity-Chat100).
- Computes pairwise sentence embedding similarities (as in the paper; e.g., using text-embedding-3-small).
- Reports metrics such as Intra-Model Similarity, Inter-Model Similarity, Top-N Unique Model Counts, and cluster analysis (e.g., PCA + clustering).
- Integrates into CI to gate releases when diversity drops below thresholds.
- Assumptions/Dependencies: Access to the dataset and embeddings API; moderate compute; agreement on similarity thresholds.
[Marketing, Creative Industries, Education] “Diversity Slider” in Generation UIs Add a user-facing control that toggles decoding regimes tuned for open-ended tasks (e.g., higher temperature, top-p, min-p), with warnings about coherence trade-offs, aligning with the paper’s findings on decoding. Assumptions/Dependencies: Product teams can switch decoding strategies; min-p implemented or proxied.
[Consumer Apps, Education, Publishing] Multi-Voice Response Mode
- Over-sampling candidate responses.
- Clustering and de-duplicating high-similarity outputs.
- Labeling options using the taxonomy (e.g., “speculative scenario,” “alternative writing genre”).
- Assumptions/Dependencies: Access to over-sampling, embeddings, clustering; UX to display alternatives clearly.
[Enterprise, Procurement, Policy] Vendor Selection and RFP Criteria Using Diversity Metrics
- A “Response Similarity Index” across models.
- Instance-level overlap reports (verbatim/semantic).
- Preference entropy coverage (how often models match human variability).
- Assumptions/Dependencies: Vendor API access; sampling budgets; internal audit standards.
[ML Research, Platform Quality] Reward Model and LM Judge Calibration Benchmarking
- Compute correlations (Spearman/Pearson) between model scores and average human ratings, especially on similar-quality and high-disagreement subsets.
- Calibrate judges (e.g., scale smoothing on contested items; score aggregation) to better reflect human variability.
- Assumptions/Dependencies: Availability of reward models/judges and their scoring APIs; internal benchmarking infrastructure.
[Data Engineering, Model Training] Synthetic Data Hygiene Checks
- Detect over-represented phrases and common tropes (e.g., “time is a river”).
- Filter or down-weight repetitive samples before inclusion in training or RLHF data.
- Assumptions/Dependencies: Access to data pipelines; willingness to enforce deduplication and diversity thresholds.
[Product Analytics] A/B Testing Protocols for Creative Features
- Use Tukey’s fences (as in the paper) to focus on similar-quality cases.
- Track shifts in intra/inter-model similarity and user satisfaction.
- Assumptions/Dependencies: Logging, analytics, prompt management; alignment with product KPIs.
[Education, Daily Life] Anti-Hivemind Prompting Guidance
- Example: “Provide five novel metaphors for time that explicitly avoid ‘river’ and ‘weaver’ motifs.”
- Assumptions/Dependencies: None beyond basic prompt literacy; portable across tools.
[Governance, Platform Policy] Diversity Impact Statements (Pilot)
- Prompts used, diversity scores, observed inter-model overlap.
- Steps taken to mitigate homogenization (e.g., multi-voice mode).
- Assumptions/Dependencies: Voluntary compliance; internal audit capability; no formal standard yet.

Long-Term Applications

These applications require further research, scaling, or development (e.g., new training regimes, standards, or governance).

[ML Training] Pluralistic Alignment with Distributional Labels
- Disagreement-aware loss functions.
- Diversity regularization in RLHF/DPO.
- Better handling of equally high-quality alternatives.
- Assumptions/Dependencies: Significant compute; method development; careful evaluation to avoid penalizing quality.
[Software, Infrastructure] Diversity-Optimized Decoding and Routing
- Dynamically select models/prompts to maximize semantic variety for open-ended tasks.
- Monitor real-time similarity to avoid convergence.
- Assumptions/Dependencies: New algorithms; multi-model access; production-safe coherence controls.
[AI Platforms] True-Diversity Ensembles
- Calibrate ensemble voting/aggregation to prioritize distinct ideas, not just paraphrases.
- Assumptions/Dependencies: Access to diverse models; coordination with model providers; ensemble reliability and cost.
[Data Governance] Training Corpus Reforms and Provenance Standards
- Mandate diversity audits for large-scale data mixtures.
- Track contamination from synthetic outputs.
- Assumptions/Dependencies: Industry-wide tooling; shared metadata standards; potential policy support.
[Healthcare, Finance, Law, Policy] Diversity-of-Opinions Decision Support Modules
- Require “preference entropy thresholds” to ensure meaningful variety in clinical or risk analysis narratives.
- Assumptions/Dependencies: Domain oversight; safety testing; guardrails to avoid confusion or misinformation.
[Standards, Certification] Generative Diversity Index and Auditing Frameworks
- Define metrics (Response Similarity Index, Preference Entropy Alignment).
- Certify models/apps against diversity benchmarks derived from Infinity-Chat.
- Assumptions/Dependencies: Multi-stakeholder consensus; third-party auditors; interoperability.
[Personalization, Consumer Apps] Individual-Level Pluralism
- Offer personalized responses while still presenting diverse alternatives.
- Respect privacy and prevent stereotype-based shortcuts.
- Assumptions/Dependencies: Consent, memory, privacy safeguards; careful evaluation of bias and overfitting.
[Multilingual, Multimodal AI] Cross-Lingual and Cross-Modal Expansion Extend Infinity-Chat’s taxonomy and annotation approach to other languages and modalities (images, audio, video) to measure hivemind effects beyond English text. Assumptions/Dependencies: New data collection; culturally aware categories; multilingual annotators.
[AI Research] Causal Analysis of Inter-Model Homogeneity Investigate sources of overlap (shared pretraining data, alignment processes, synthetic contamination) and design interventions that demonstrably reduce convergence without sacrificing quality. Assumptions/Dependencies: Access to training details (often proprietary); collaboration with model providers; reproducible methodologies.

View Paper Prompt View All Prompts

Glossary

Absolute ratings: Numeric assessments of the quality of a single response to a query. "Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example."
Artificial Hivemind: A phenomenon where LLMs converge to similar outputs across and within models, reducing diversity. "We uncover a pronounced Artificial Hivemind effect: (1) intra-model repetition, where a single model repeatedly generates similar outputs, and, more critically, (2) inter-model homogeneity, where different models independently converge on similar ideas with minor variations in phrasing."
Ensemble methods: Techniques that combine multiple models to improve performance or diversity of outputs. "While ensemble methods or model ``swarms'' have been proposed to enhance diversity"
HHH rubric: A three-part evaluation rubric—Helpfulness, Harmlessness, Honesty—used by LM judges to rate responses. "LM judge ratings follow standard prompting protocols using two rubrics: an overall quality score and the HHH rubric (Helpfulness, Harmlessness, Honesty)"
Inter-model homogeneity: Similarity of outputs produced by different models for the same task or prompt. "inter-model homogeneity, where different models produce strikingly similar outputs"
Intra-model repetition: The tendency of a single model to generate highly similar responses when sampled multiple times. "intra-model repetition, where a single model consistently generates similar responses"
IQR (Interquartile Range): The range between the 25th and 75th percentiles, used to assess spread and detect outliers. "IQR = Q3 - Q1"
LM judge: A LLM used to evaluate and score other models’ outputs according to specified rubrics. "our findings show that state-of-the-art LMs, reward models, and LM judges are less well calibrated to human ratings"
Min-p decoding: A dynamic sampling strategy that adjusts the minimum probability threshold to enhance diversity in generation. "Recent work introduces min-p decoding, a dynamic strategy for enhancing generation diversity that adjusts the sampling threshold based on model confidence."
Mode collapse: Failure of a generative model to produce diverse outputs, instead concentrating on a few modes. "we present a large-scale study of mode collapse in LMs"
Pairwise preferences: Annotations where evaluators indicate which of two responses is preferred (including ties or strength of preference). "Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences"
Pearson correlations: A measure of linear association between two variables’ scores. "We then compute Pearson correlations between model and human absolute ratings on the full set"
Perplexity: A measure of a LLM’s uncertainty; lower perplexity indicates better predictive performance. "LM scores are derived from response perplexity given the query."
Pluralistic alignment: Aligning models to accommodate diverse, potentially conflicting human preferences and values. "advancing pluralistic alignment of LMs"
Principal Component Analysis (PCA): A dimensionality reduction technique that projects data onto principal axes capturing maximal variance. "clustered by applying PCA to reduce sentence embeddings to two dimensions."
Reward model: A model that produces a scalar score indicating the quality or desirability of a response. "state-of-the-art LMs, reward models, and LM judges"
Sentence embeddings: Vector representations of sentences that capture semantic content for similarity and clustering. "Sentence embeddings from OpenAI’s text-embedding-3-small API are used."
Shannon entropy: An information-theoretic measure of uncertainty or disagreement in a distribution of labels or ratings. "The left histogram shows the distribution of Shannon entropy across the 25 human annotations"
Spearman's correlation coefficient: A rank-based correlation measure assessing monotonic relationships between variables. "We compute Spearman's correlation coefficients between human-annotated and model-generated absolute rating scores"
Temperature: A decoding parameter that controls randomness in sampling; higher values increase diversity at the risk of incoherence. "temperature $= 1.0$ "
Top-p sampling: Nucleus sampling that selects from the smallest set of tokens whose cumulative probability exceeds p. "top-p sampling ( $p = 0.9$ )"
Tukey’s fences: An outlier detection method using quartiles and IQR to filter extreme values. "This method defines outliers as points beyond $Q_1 - k \cdot \mathrm{IQR}$ or $Q_3 + k \cdot \mathrm{IQR}$ "

View Paper Prompt View All Prompts

Open Problems

Determine the causes of cross-model repetition in open-ended LLM outputs

Continue Learning

Authors (10)

Collections

Tweets

This paper has been mentioned in 1 tweet and received 105 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) (2510.22954v1)

Summary

Artificial Hivemind: The Open-Ended Homogeneity of LLMs

Introduction and Motivation

Infinity-Chat: Dataset and Taxonomy

Intra-Model and Inter-Model Homogeneity

Intra-Model Repetition

Inter-Model Homogeneity

Human Annotation and Pluralistic Preferences

Model Calibration to Human Judgments

Mechanisms and Implications

Future Directions

Conclusion

References

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they paper it?

Building a real-world question collection (Infinity-Chat)

Categorizing questions (a taxonomy)

Testing many AI models and measuring similarity

Gathering lots of human opinions

Comparing model judges to humans

What did they find and why it matters?

What does this mean for the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (10)

Collections

Tweets