- The paper demonstrates that using only the title and summary of case studies yields the highest correlation with expert impact evaluations.
- It employs ChatGPT 4o-mini to analyze 6,220 REF2021 Impact Case Studies, revealing discipline-specific performance differences with correlations from 0.18 to 0.71.
- The study highlights potential biases in automated evaluations, suggesting the need for tailored prompting strategies and norm-referencing across disciplines.
Evaluating ChatGPT's Capacity for Assessing Societal Impact Claims
This paper investigates the potential of employing ChatGPT as a tool in the evaluation of societal impact claims presented in the UK's Research Excellence Framework (REF) Impact Case Studies (ICS). Given the structured demands of the REF, which utilizes ICS to gauge non-academic impacts of research, this paper explores whether a LLM like ChatGPT can assist experts in scrutinizing these narratives efficiently.
Methodology and Key Findings
The researchers utilized ChatGPT 4o-mini, a variant of the LLM, to process 6,220 ICS from the REF2021. By inputting different sections of these case studies into the model, they sought to determine which combination of text sections yielded the most accurate scores when compared to departmental average ICS scores. The paper found that providing ChatGPT with just the title and summary of an ICS produced the highest correlations with expert scores. This finding intimates that complex, detailed narratives may not enhance machine evaluation of impact claims as anticipated. In contrast, a more condensed input appears to aid the model in aligning with human assessments.
In the analysis, the paper reports correlation values ranging from as low as 0.18 in fields like Economics and Econometrics to as high as 0.71 in Sport and Exercise Sciences, Leisure, and Tourism. These correlation variations underscore the model's differential performance across disciplines, with notable discrepancies in how ChatGPT scores impact significance and reach.
Implications and Challenges
The implications of employing ChatGPT for evaluating ICS are multifaceted. In practical terms, such a model could streamline the labor-intensive process of societal impact assessment, offering preliminary insights that can be confirmed by human experts. Yet, the paper also highlights significant disciplinary biases inherent to the model's evaluations, which suggest that outputs would require norm-referencing across different Units of Assessment (UoAs) to be meaningfully compared.
Another challenge lies in the model's consistent tendency to overestimate impact quality, typically assigning high scores regardless of the detailed content of the narratives. This propensity could undermine the fair evaluation of impact, necessitating tailored prompting strategies and perhaps enhanced model training to refine accuracy.
Theoretical Considerations
From a theoretical standpoint, the paper reveals a tension between automated assessment and the nuanced understanding required for evaluating societal impact. The model's reliance on a restricted input set underscores a potential limitation in its capacity to fully grasp the depth and breadth of societal impacts claimed within ICS. This raises questions about the interpretability and context-awareness of LLMs when tasked with complex evaluative roles.
Future Directions
Given the rapid development of AI technologies, future research could examine whether enhanced or more sophisticated models produce higher correlations and more discriminative evaluations. Moreover, exploring the integration of augmented data, such as citation indices or alternative metrics alongside narrative inputs, might improve score predictions, potentially aligning automated evaluations more closely with expert judgment.
Ultimately, while this paper highlights the promising potential of LLMs like ChatGPT to support societal impact evaluations, it also clarifies the remaining challenges and areas for development before such a system could be reliably embedded into formal assessment frameworks. Further work is essential to address the identified biases and improve disciplinary equity in AI-assisted evaluations.