Are LLMs accurate and unbiased enough for research evaluation roles?
Establish whether large language models can achieve sufficient accuracy and impartiality, by explicit criteria and empirical testing, to play a reliable role in research evaluation workflows, and delineate acceptable use cases if standards can be met.
References
It is not clear yet whether LLMs like ChatGPT can be made accurate and unbiased enough to have a role in research evaluation.
— Quantitative Methods in Research Evaluation Citation Indicators, Altmetrics, and Artificial Intelligence
(2407.00135 - Thelwall, 28 Jun 2024) in Section 14.10