Dice Question Streamline Icon: https://streamlinehq.com

Cause of near-maximal alignment between ChatGPT and departmental REF correlations

Determine the underlying causes for cases in which the Spearman correlation between ChatGPT 4o-mini’s article-level quality scores (derived from titles and abstracts) and departmental average REF2021 scores is implausibly close to the estimated Spearman correlation between actual individual article REF scores and departmental averages; assess whether this closeness is driven by content-based evaluation of abstracts, department-linked metadata, or other field-specific factors.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper correlates ChatGPT 4o-mini’s quality scores for REF2021 journal article abstracts with departmental average REF scores, and separately bootstraps the theoretical maximum correlation between individual article REF scores and departmental averages. In several Units of Assessment (UoAs), the observed ChatGPT–department correlation was strikingly close to this estimated maximum.

The authors suggest potential explanations, including the possibility that higher-quality departments present more ambitious or consistent claims in abstracts or that ChatGPT may implicitly link articles to departmental identities and public REF scores. Clarifying the mechanism is essential for interpreting the validity and independence of ChatGPT-based assessments.

References

It is not clear why the ChatGPT correlations with departmental averages were sometimes implausibly close to the estimated correlations between article scores and departmental averages (Figure 1).

In which fields can ChatGPT detect journal article quality? An evaluation of REF2021 results (2409.16695 - Thelwall et al., 25 Sep 2024) in Section: High ChatGPT correlation with departmental average scores compared to estimated correlation between article scores and departmental averages (Figure 5 discussion)