- The paper reveals that LLMs exhibit highly correlated errors, with models incorrectly agreeing 60% of the time on the Helm leaderboard.
- It employs a comprehensive empirical evaluation across 440+ models from multiple leaderboards and real-world tasks to assess error agreement.
- The study highlights risks in high-stakes applications, such as hiring and judicial assessments, stressing the need for mitigating systemic biases.
Correlated Errors in LLMs: An Examination
The paper "Correlated Errors in LLMs" addresses a significant gap in our empirical understanding of the diversity among LLMs. Despite the common assumption that diversity in training data, architecture, and providers might result in a heterogenous ecosystem of LLMs, this investigation reveals a propensity for these models to exhibit highly correlated errors.
Empirical Evaluation of LLMs
The authors conduct a comprehensive empirical paper employing data from a wide range of LLMs, including responses from 349 models sourced from the HuggingFace leaderboard, 71 models from the Helm leaderboard \cite{liang2023holistic}, and 20 models evaluated using 1,800 resume-job description pairs. The paper targets the agreement rate among models when both deliver incorrect responses, revealing substantial correlation across all datasets. On the Helm leaderboard, for instance, model pairs agree incorrectly 60% of the time, which exceeds random expectation significantly.
Sources and Implications of Model Correlation
The research identifies several contributing factors to this correlation: shared providers, architectures, and similar model sizes often yield higher error correlation. Crucially, the paper finds even newer, more accurate models exhibit such correlations, indicating convergence in error patterns despite divergent architectures and providers. This has tangible implications in practical contexts like LLM-as-judge systems and hiring processes. In the latter case, higher correlation among models could lead to systemic exclusion of certain candidates, as firms might uniformly rely on models exhibiting similar biases or errors.
Downstream Effects in High-Stakes Applications
The downstream effects of these error correlations are analyzed in two scenarios: LLM-as-judge and hiring tasks. In LLM-as-judge setups, where one model evaluates others, judges tend to inflate the accuracy of models less accurate than themselves, particularly benefiting those of the same provider or architecture. This bias challenges the reliability of LLM accuracy assessments and highlights the risk of self-preferencing when using related models for evaluation.
In the context of hiring, the investigation models labor markets where firms use either the same LLM, LLMs from the same provider, or various models randomly chosen from available options. The findings indicate that using models with correlated errors—such as from the same provider—can reinforce systemic exclusion, whereas more diverse model usage mitigates this risk, though inaccuracies remain.
Theoretical and Future Directions
The implications of this paper extend beyond immediate practical applications, prompting theoretical considerations about algorithmic monoculture in decision-making systems. While algorithmic diversity could theoretically enhance robustness by leveraging "wisdom of crowds" effects, the findings suggest such diversity is insufficient if model errors remain correlated.
Future research may focus on strategies for mitigating these correlations, such as developing metrics for selecting non-correlated model subsets or engineering more diverse training paradigms. These efforts could enhance the multi-agent ecosystems in which LLMs operate, ensuring greater independence and reliability of model outputs.
In conclusion, this paper provides critical insights into the correlations present in LLM outputs, demonstrating that despite the increasing variety in model architecture and provenance, the errors among these models remain substantially aligned. Addressing this issue is essential to fully realize the potential of diverse multi-model ecosystems.