Correlated Errors in Large Language Models (2506.07962v1)

Published 9 Jun 2025 in cs.CL, cs.AI, cs.CY, and stat.ML

Abstract: Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors -- on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring -- the latter reflecting theoretical predictions regarding algorithmic monoculture.

Summary

The paper reveals that LLMs exhibit highly correlated errors, with models incorrectly agreeing 60% of the time on the Helm leaderboard.
It employs a comprehensive empirical evaluation across 440+ models from multiple leaderboards and real-world tasks to assess error agreement.
The study highlights risks in high-stakes applications, such as hiring and judicial assessments, stressing the need for mitigating systemic biases.

Correlated Errors in LLMs: An Examination

The paper "Correlated Errors in LLMs" addresses a significant gap in our empirical understanding of the diversity among LLMs. Despite the common assumption that diversity in training data, architecture, and providers might result in a heterogenous ecosystem of LLMs, this investigation reveals a propensity for these models to exhibit highly correlated errors.

Empirical Evaluation of LLMs

The authors conduct a comprehensive empirical paper employing data from a wide range of LLMs, including responses from 349 models sourced from the HuggingFace leaderboard, 71 models from the Helm leaderboard \cite{liang2023holistic}, and 20 models evaluated using 1,800 resume-job description pairs. The paper targets the agreement rate among models when both deliver incorrect responses, revealing substantial correlation across all datasets. On the Helm leaderboard, for instance, model pairs agree incorrectly 60% of the time, which exceeds random expectation significantly.

Sources and Implications of Model Correlation

The research identifies several contributing factors to this correlation: shared providers, architectures, and similar model sizes often yield higher error correlation. Crucially, the paper finds even newer, more accurate models exhibit such correlations, indicating convergence in error patterns despite divergent architectures and providers. This has tangible implications in practical contexts like LLM-as-judge systems and hiring processes. In the latter case, higher correlation among models could lead to systemic exclusion of certain candidates, as firms might uniformly rely on models exhibiting similar biases or errors.

Downstream Effects in High-Stakes Applications

The downstream effects of these error correlations are analyzed in two scenarios: LLM-as-judge and hiring tasks. In LLM-as-judge setups, where one model evaluates others, judges tend to inflate the accuracy of models less accurate than themselves, particularly benefiting those of the same provider or architecture. This bias challenges the reliability of LLM accuracy assessments and highlights the risk of self-preferencing when using related models for evaluation.

In the context of hiring, the investigation models labor markets where firms use either the same LLM, LLMs from the same provider, or various models randomly chosen from available options. The findings indicate that using models with correlated errors—such as from the same provider—can reinforce systemic exclusion, whereas more diverse model usage mitigates this risk, though inaccuracies remain.

Theoretical and Future Directions

The implications of this paper extend beyond immediate practical applications, prompting theoretical considerations about algorithmic monoculture in decision-making systems. While algorithmic diversity could theoretically enhance robustness by leveraging "wisdom of crowds" effects, the findings suggest such diversity is insufficient if model errors remain correlated.

Future research may focus on strategies for mitigating these correlations, such as developing metrics for selecting non-correlated model subsets or engineering more diverse training paradigms. These efforts could enhance the multi-agent ecosystems in which LLMs operate, ensuring greater independence and reliability of model outputs.

In conclusion, this paper provides critical insights into the correlations present in LLM outputs, demonstrating that despite the increasing variety in model architecture and provenance, the errors among these models remain substantially aligned. Addressing this issue is essential to fully realize the potential of diverse multi-model ecosystems.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos