Do Code Models Suffer from the Dunning-Kruger Effect?

Published 6 Oct 2025 in cs.AI, cs.CL, and cs.SE | (2510.05457v1)

Abstract: As artificial intelligence systems increasingly collaborate with humans in creative and technical domains, questions arise about the cognitive boundaries and biases that shape our shared agency. This paper investigates the Dunning-Kruger Effect (DKE), the tendency for those with limited competence to overestimate their abilities in state-of-the-art LLMs in coding tasks. By analyzing model confidence and performance across a diverse set of programming languages, we reveal that AI models mirror human patterns of overconfidence, especially in unfamiliar or low-resource domains. Our experiments demonstrate that less competent models and those operating in rare programming languages exhibit stronger DKE-like bias, suggesting that the strength of the bias is proportionate to the competence of the models.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that lower-performing code models significantly overestimate their accuracy, especially in low-resource programming languages.
It employs both absolute and relative confidence metrics with correlations up to 0.797 to rigorously quantify miscalibration.
The study emphasizes the need for enhanced calibration in LLMs to improve trust and reliability in human-AI collaborative coding tasks.

Evidence and Analysis of the Dunning-Kruger Effect in Code LLMs

Introduction

The paper "Do Code Models Suffer from the Dunning-Kruger Effect?" (2510.05457) presents a formal empirical investigation into whether LLMs exhibit the Dunning-Kruger Effect (DKE) during programming tasks. DKE, characterized by inflated self-assessment in domains where competence is low, has broad implications for calibration, interpretability, and trust in human-AI collaboration. The study situates this inquiry within the technical domain of code generation and question answering, employing both absolute and relative confidence metrics to dissect model calibration across a spectrum of programming languages, including rare low-resource domains.

Methods

The authors operationalize competence as accuracy on MCQA tasks and perceived performance as confidence (absolute scores and relative ranking via Elo and TrueSkill). Absolute confidence is elicited directly from models per task; relative confidence is assessed via pairwise preference prompts and subsequently aggregated. Analysis spans two axes:

Inter-model (Across Models): Comparison of overconfidence between models of varying base performance.
Intra-model (Across Domains): Examination of calibration per programming language domain for each model.

Statistical correlation (Spearman, Pearson, Kendall) quantifies DKE, defined as the magnitude by which confidence exceeds accuracy in low-competence regimes. Domain rarity is incorporated via established popularity metrics (GitHub, IEEE, TIOBE).

Results

DKE Manifestation Across Models and Domains

The findings demonstrate that LLMs, particularly those with lower base performance and in less familiar domains, consistently exhibit substantial DKE-like miscalibration. This effect holds across absolute and relative confidence measures, with pronounced gaps in rare programming languages such as COBOL, Prolog, and Ceylon. Higher-performing models (e.g., GPT-4O) display improved calibration but still show domain-specific overconfidence as actual performance decreases.

Quantitatively:

Inter-model DKE: Lower competence models overestimate to a greater degree, with statistically significant correlations ( $\rho > 0.64$ , $p < 2 \times 10^{-5}$ ).
Intra-domain DKE: Overestimation is strongly negatively correlated with actual domain performance. Models are reliably overconfident in rare-languages; correlations with domain rarity reach $\rho = 0.797$ (GitHub ranking).

Specialization and Calibration

Specialized models trained on limited domains display even stronger DKE ( $\rho = 0.921$ for single specialized models) compared to generalist counterparts, reinforcing the hypothesis that training distribution width impacts self-assessment reliability.

Task Structure and Confidence Estimation

While MCQA tasks allow precise confidence elicitation, code generation tasks yield weaker DKE signals, suggesting that confidence estimation is more challenging for open-ended or partially correct outputs.

Implications

Trust and Interpretability

These results signal caution in relying on LLMs’ self-assessment, especially in rare or under-represented domains. DKE-like bias may mislead end-users—particularly in collaborative or semi-autonomous workflows involving code, where the model's confident output is interpreted as authoritative. Miscalibration risks propagate to downstream tasks including automated evaluation, self-improvement, and multi-agent composition (e.g., reviewer agents).

Cognitive vs Statistical Mechanisms

While the paper reveals behavioral alignment with human DKE, it remains agnostic on whether underlying mechanisms are cognitive (e.g., meta-cognition) or statistical (e.g., regression to the mean). The authors highlight the necessity for future research disentangling these causalities and exploring calibration interventions.

Model Development and Evaluation

The findings emphasize the criticality of rigorous calibration monitoring: metric selection (absolute vs relative confidence), specialized vs broad training, and domain coverage must be accounted for when designing, deploying, and evaluating code LLMs. There is a clear research incentive to develop architectures and training paradigms that mitigate DKE-like behavior and improve cross-domain calibration.

Authorship and Human-AI Collaboration

As AI systems co-author creative and technical artifacts alongside humans, awareness of cognitive bias patterns (such as DKE) may inform new transparency, trust, and auditing frameworks that ensure robust joint stewardship.

Conclusion

The paper establishes clear, statistically supported evidence of DKE-like bias in code LLMs, with strength proportional to both base model competence and programming language rarity. Calibration reliability varies markedly across models, domains, and task types. These insights motivate further interdisciplinary inquiry into AI bias origins, challenge assumptions of LLM reliability in unfamiliar domains, and underscore the importance of improved confidence modeling for trustful human-AI collaboration.

Markdown Report Issue