Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 42 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 13 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 217 tok/s Pro

GPT OSS 120B 474 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Assessing the societal influence of academic research with ChatGPT: Impact case study evaluations (2410.19948v1)

Published 25 Oct 2024 in cs.DL and cs.AI

Abstract: Academics and departments are sometimes judged by how their research has benefitted society. For example, the UK Research Excellence Framework (REF) assesses Impact Case Studies (ICS), which are five-page evidence-based claims of societal impacts. This study investigates whether ChatGPT can evaluate societal impact claims and therefore potentially support expert human assessors. For this, various parts of 6,220 public ICS from REF2021 were fed to ChatGPT 4o-mini along with the REF2021 evaluation guidelines, comparing the results with published departmental average ICS scores. The results suggest that the optimal strategy for high correlations with expert scores is to input the title and summary of an ICS but not the remaining text, and to modify the original REF guidelines to encourage a stricter evaluation. The scores generated by this approach correlated positively with departmental average scores in all 34 Units of Assessment (UoAs), with values between 0.18 (Economics and Econometrics) and 0.56 (Psychology, Psychiatry and Neuroscience). At the departmental level, the corresponding correlations were higher, reaching 0.71 for Sport and Exercise Sciences, Leisure and Tourism. Thus, ChatGPT-based ICS evaluations are simple and viable to support or cross-check expert judgments, although their value varies substantially between fields.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that using only the title and summary of case studies yields the highest correlation with expert impact evaluations.
It employs ChatGPT 4o-mini to analyze 6,220 REF2021 Impact Case Studies, revealing discipline-specific performance differences with correlations from 0.18 to 0.71.
The study highlights potential biases in automated evaluations, suggesting the need for tailored prompting strategies and norm-referencing across disciplines.

Evaluating ChatGPT's Capacity for Assessing Societal Impact Claims

This paper investigates the potential of employing ChatGPT as a tool in the evaluation of societal impact claims presented in the UK's Research Excellence Framework (REF) Impact Case Studies (ICS). Given the structured demands of the REF, which utilizes ICS to gauge non-academic impacts of research, this paper explores whether a LLM like ChatGPT can assist experts in scrutinizing these narratives efficiently.

Methodology and Key Findings

The researchers utilized ChatGPT 4o-mini, a variant of the LLM, to process 6,220 ICS from the REF2021. By inputting different sections of these case studies into the model, they sought to determine which combination of text sections yielded the most accurate scores when compared to departmental average ICS scores. The paper found that providing ChatGPT with just the title and summary of an ICS produced the highest correlations with expert scores. This finding intimates that complex, detailed narratives may not enhance machine evaluation of impact claims as anticipated. In contrast, a more condensed input appears to aid the model in aligning with human assessments.

In the analysis, the paper reports correlation values ranging from as low as 0.18 in fields like Economics and Econometrics to as high as 0.71 in Sport and Exercise Sciences, Leisure, and Tourism. These correlation variations underscore the model's differential performance across disciplines, with notable discrepancies in how ChatGPT scores impact significance and reach.

Implications and Challenges

The implications of employing ChatGPT for evaluating ICS are multifaceted. In practical terms, such a model could streamline the labor-intensive process of societal impact assessment, offering preliminary insights that can be confirmed by human experts. Yet, the paper also highlights significant disciplinary biases inherent to the model's evaluations, which suggest that outputs would require norm-referencing across different Units of Assessment (UoAs) to be meaningfully compared.

Another challenge lies in the model's consistent tendency to overestimate impact quality, typically assigning high scores regardless of the detailed content of the narratives. This propensity could undermine the fair evaluation of impact, necessitating tailored prompting strategies and perhaps enhanced model training to refine accuracy.

Theoretical Considerations

From a theoretical standpoint, the paper reveals a tension between automated assessment and the nuanced understanding required for evaluating societal impact. The model's reliance on a restricted input set underscores a potential limitation in its capacity to fully grasp the depth and breadth of societal impacts claimed within ICS. This raises questions about the interpretability and context-awareness of LLMs when tasked with complex evaluative roles.

Future Directions

Given the rapid development of AI technologies, future research could examine whether enhanced or more sophisticated models produce higher correlations and more discriminative evaluations. Moreover, exploring the integration of augmented data, such as citation indices or alternative metrics alongside narrative inputs, might improve score predictions, potentially aligning automated evaluations more closely with expert judgment.

Ultimately, while this paper highlights the promising potential of LLMs like ChatGPT to support societal impact evaluations, it also clarifies the remaining challenges and areas for development before such a system could be reliably embedded into formal assessment frameworks. Further work is essential to address the identified biases and improve disciplinary equity in AI-assisted evaluations.