Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 58 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 12 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 179 tok/s Pro

GPT OSS 120B 463 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

In which fields can ChatGPT detect journal article quality? An evaluation of REF2021 results (2409.16695v1)

Published 25 Sep 2024 in cs.DL

Abstract: Time spent by academics on research quality assessment might be reduced if automated approaches can help. Whilst citation-based indicators have been extensively developed and evaluated for this, they have substantial limitations and LLMs like ChatGPT provide an alternative approach. This article assesses whether ChatGPT 4o-mini can be used to estimate the quality of journal articles across academia. It samples up to 200 articles from all 34 Units of Assessment (UoAs) in the UK's Research Excellence Framework (REF) 2021, comparing ChatGPT scores with departmental average scores. There was an almost universally positive Spearman correlation between ChatGPT scores and departmental averages, varying between 0.08 (Philosophy) and 0.78 (Psychology, Psychiatry and Neuroscience), except for Clinical Medicine (rho=-0.12). Although other explanations are possible, especially because REF score profiles are public, the results suggest that LLMs can provide reasonable research quality estimates in most areas of science, and particularly the physical and health sciences and engineering, even before citation data is available. Nevertheless, ChatGPT assessments seem to be more positive for most health and physical sciences than for other fields, a concern for multidisciplinary assessments, and the ChatGPT scores are only based on titles and abstracts, so cannot be research evaluations.

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates ChatGPT's capability to estimate article quality by comparing generated scores with REF2021 departmental averages.
It identifies strong correlations (up to 0.78 in Psychology) and highlights discipline-specific variations, with Clinical Medicine showing a negative correlation.
The study implies that while LLMs can support automated research evaluations, their accuracy depends on abstract clarity and departmental practices, necessitating further refinement.

An Evaluation of ChatGPT's Ability to Detect Journal Article Quality Across Disciplines

The paper explores the potential for ChatGPT 4o-mini to estimate the quality of journal articles across various academic fields by utilizing the dataset from the UK’s Research Excellence Framework (REF) 2021. The authors investigated if LLMs could serve as reliable automated tools for research evaluation, potentially reducing the time and effort spent by academics in this labor-intensive process.

Methodological Approach

Using Spearman correlations, the research compared ChatGPT-generated scores based on article titles and abstracts with departmental average scores from REF2021. The sample consists of 200 articles drawn from the highest and lowest quartile departments across 34 Units of Assessment (UoAs), encompassing a spectrum of academic disciplines.

Key Findings

The Spearman correlations between ChatGPT scores and departmental averages were positive for nearly all UoAs, with noteworthy variability. The strongest correlation appeared in Psychology, Psychiatry, and Neuroscience (0.78), suggesting that LLMs may provide reliable estimations in certain fields.
In contrast, Clinical Medicine was an outlier with a negative correlation (-0.12), raising questions about the model's efficacy in fields where clinical applicability is significant.
The paper found that ChatGPT correlations were typically close to the theoretical maximum correlation (estimated article scores vs. departmental averages), indicating a close alignment in several disciplines.

Critical Observations

A critical insight is that ChatGPT's performance varied significantly between disciplines. The strong results in areas such as psychology may be due to the distinct nature of these fields, where abstracts can encapsulate the core contributions more succinctly. Conversely, the negative result in Clinical Medicine may stem from the structured nature of these articles, which ChatGPT had difficulty processing effectively without section headers.

The paper also highlighted that higher scoring departments might influence ChatGPT's outputs through consistently high-quality abstracts rather than purely the paper's merit. This introduces a complexity where departmental practices impact evaluation metrics beyond the individual paper's inherent quality.

Implications and Speculation

The implications of employing ChatGPT in research evaluations are substantial. If implemented judiciously, LLMs have the potential to support peer-review processes where citation data is insufficient or unavailable. However, the field-specific biases observed necessitate a cautious approach, particularly in interdisciplinary assessments where differential weightings across fields could skew the results.

The paper implicitly challenges the current reliance on citation-based indicators by presenting an alternative that functions independently of existing citation data. This presents possibilities for faster and potentially more insightful evaluations in research domains where citations lag post-publication.

Future Prospects

Further developments in AI-based assessment could focus on improving model accuracy for fields previously demonstrating weaker correlations. Incorporating full-text analysis or enhancing abstract parsing strategies might mitigate some of the disparities seen. Additionally, consideration of factors like the complexity of abstracts or the nature of claims made could provide a more nuanced understanding of an article's quality.

Conclusion

This investigation underscores a significant step towards understanding the extent to which AI, represented by ChatGPT, can be integrated into academic quality assessments. While promising results emerged for specific fields, the challenges highlighted necessitate continued research to refine these automated methodologies and ensure their reliability across a broader academic spectrum. The results also provoke a broader discourse on the intersection of AI and traditional peer review, pressing the academic community to navigate the integration of these tools while safeguarding the integrity of research evaluations.