Are LLMs accurate and unbiased enough for research evaluation roles?

Establish whether large language models can achieve sufficient accuracy and impartiality, by explicit criteria and empirical testing, to play a reliable role in research evaluation workflows, and delineate acceptable use cases if standards can be met.

Background

The plausibility of LLM outputs, coupled with risks of hidden biases, raises concerns about their deployment in evaluative contexts. Determining their readiness requires systematic benchmarking against confidential expert judgements and stringent bias assessments.

Clear standards for accuracy and fairness are needed to decide whether, and how, LLMs can responsibly support or augment evaluation tasks.

References

It is not clear yet whether LLMs like ChatGPT can be made accurate and unbiased enough to have a role in research evaluation.

— Quantitative Methods in Research Evaluation Citation Indicators, Altmetrics, and Artificial Intelligence (2407.00135 - Thelwall, 28 Jun 2024) in Section 14.10

Are LLMs accurate and unbiased enough for research evaluation roles?

Sponsor

Background

References

Related Problems