Disentangle ChatGPT bias from genuine international quality differences in country-associated score gaps

Determine whether the observed differences in ChatGPT 4o-mini REF-style research quality scores across first-author countries are caused by bias within ChatGPT, underlying international differences in research quality, or both, and quantify the relative contribution of each factor within each Scopus broad field to enable fair international comparisons.

Background

The study’s regressions revealed systematic differences in ChatGPT-derived REF-style quality scores by first-author country, with most top-publishing nations tending to receive higher scores, although Canada showed the most consistent positive association. The authors caution that such differences could reflect either model bias or genuine cross-national quality differences, both of which are plausible.

Because the purpose of these scores includes research evaluation and cross-national comparisons, distinguishing model bias from authentic quality variation is essential. The paper explicitly calls for further research to identify which factors drive the observed country effects and to measure their relative importance within each field.

References

The first author country differences found could indicate ChatGPT bias and/or underlying international differences in the quality of research, with the latter being widely believed to occur by policy makers. Further research is needed to identify whether both are contributors and, if so, the relative balance between them within each field.

Research evaluation with ChatGPT: Is it age, country, length, or field biased?  (2411.09768 - Thelwall et al., 2024) in Conclusions