Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 58 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 179 tok/s Pro
GPT OSS 120B 463 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Can ChatGPT evaluate research quality? (2402.05519v1)

Published 8 Feb 2024 in cs.DL and cs.AI

Abstract: Purpose: Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task. Design/methodology/approach: Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements. Findings: ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author's significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations. Research limitations: The data is self-evaluations of a convenience sample of articles from one academic in one field. Practical implications: Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use. Originality/value: This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.

Citations (8)

Summary

  • The paper shows that ChatGPT-4 produces plausible evaluations based on REF criteria, with individual scores averaging a Pearson correlation of 0.281 and an improved correlation of 0.509 when averaged.
  • It applies the UK REF 2021 framework to a convenience sample of 51 articles, highlighting a bias towards higher ratings concentrated around the 3* score.
  • The study cautions against unsupervised use of LLMs for research evaluations and calls for further integration with traditional methods to address precision limitations.

Evaluation of ChatGPT 4.0's Capability for Research Quality Assessment

The paper by Mike Thelwall addresses the potential of ChatGPT 4.0, a LLM, for automating research quality evaluation tasks. Specifically, it investigates whether the model can reliably assess the quality of academic journal articles, with the intent of alleviating the burdensome aspect of traditional peer review and post-publication evaluations.

The research utilizes the UK Research Excellence Framework (REF) 2021 quality criteria, applied to a convenience sample of 51 articles authored by Thelwall himself. The REF criteria broadly incorporate assessments of originality, significance, and rigor across scholarship. ChatGPT-4’s evaluations were compared against the author’s own quality judgments. This paper represents the first comprehensive evaluation of ChatGPT for post-publication expert review tasks.

Key Findings

The primary findings indicate that while ChatGPT-4 can produce plausible summaries and provide rationales aligned with REF criteria, its individual evaluation scores exhibit weak correlations with Thelwall's scores, averaging a Pearson correlation coefficient of 0.281 over 15 iterations. Notably, an average score obtained via multiple interactions with ChatGPT-4 yields a stronger, statistically significant correlation of 0.509 with the author's scores. It is suggested that ChatGPT-4 may have relied on extracting information presented within the articles regarding their significance, rigor, and originality.

Furthermore, the paper notes that scores from ChatGPT were predominantly concentrated around the 3* rating, showing a bias towards higher scores compared to the author's more variable self-evaluations. This proclivity indicates that ChatGPT might not be fully capturing the nuances that differentiate between varying levels of research quality. The inability of ChatGPT to utilize the full range of scores highlights a limitation in its precision for fine-grained assessments, particularly for high-quality articles.

Limitations and Implications

The analysis acknowledges substantial limitations due to the homogeneity of the article sample and reliance on self-evaluations, which inherently introduces a subjective bias. Consequently, conclusions on ChatGPT’s performance might vary across different fields or with distinct samples evaluated by independent REF assessors rather than the author. Therefore, further research is imperative, examining the performance of contemporary or future LLMs with varying configurations or task specifications.

Practically, LLMs like ChatGPT are not yet recommended for unsupervised formal research evaluations. Misuse or overreliance on ChatGPT could lead to fundamentally flawed assessments or breach copyright conditions, especially for unpublished or non-open access documents. The results advocate strict regulatory oversight on the use of LLMs in editorial and review processes to prevent potential risks stemming from superficially plausible yet inaccurate assessments.

Theoretical Reflections and Future Directions

Theoretically, the paper provides an empirical foundation for understanding the reasons behind ChatGPT's partial success in research evaluation tasks. Its reliance on document-internal claims points to limited contextualization ability when faced with exaggerated author assertions or satire. A conducted experiment, wherein ChatGPT erroneously scored a satirical article highly, exemplifies that ChatGPT does not yet exhibit advanced critical reasoning or fact-checking capabilities.

Future work could potentially explore the integration of LLMs with traditional algorithmic approaches, possibly using LLMs to identify and highlight confidence gaps in existing automated systems or as an ancillary tool in less rigorous evaluation scenarios. However, extensive work remains before such techniques can be applied with confidence to high-stakes evaluative workflows. As LLMs evolve, their applications in scholarly domains may expand significantly, provided their ability to discern context and intricacies of academic discourse continues to enhance. Nonetheless, until marked improvements are made, human oversight will remain indispensable in research quality assessments.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)