Explaining why ChatGPT achieves partial accuracy in REF scoring

Ascertain the mechanisms by which ChatGPT-4 attains some degree of accuracy when estimating Research Excellence Framework (REF) 2021 quality scores; specifically, determine whether its performance primarily arises from inferring quality from author-stated claims within the article text rather than applying external knowledge.

Background

The paper shows that ChatGPT-4 frequently produces plausible assessments aligned with REF criteria but only weak-to-moderate agreement with the author’s own quality scores, suggesting limited evaluative capability. Despite this, a statistically significant positive correlation emerges when scores are averaged over multiple runs.

The authors explicitly state that the reason for this partial accuracy is unknown and hypothesize that ChatGPT may rely mainly on claims made within the article text instead of integrating external information. Clarifying the source of this partial accuracy is an open question.

References

It is not clear why it can score articles with some degree of accuracy, but it might typically deduce them from author claims inside an article rather than by primarily applying external information.

— Can ChatGPT evaluate research quality? (2402.05519 - Thelwall, 8 Feb 2024) in Section 6 (Conclusion)

Explaining why ChatGPT achieves partial accuracy in REF scoring

Sponsor

Background

References

Related Problems