Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can AI Solve the Peer Review Crisis? A Large Scale Cross Model Experiment of LLMs' Performance and Biases in Evaluating over 1000 Economics Papers (2502.00070v2)

Published 31 Jan 2025 in cs.CY, cs.AI, econ.GN, and q-fin.EC

Abstract: This study examines the potential of LLMs to augment the academic peer review process by reliably evaluating the quality of economics research without introducing systematic bias. We conduct one of the first large-scale experimental assessments of four LLMs (GPT-4o, Claude 3.5, Gemma 3, and LLaMA 3.3) across two complementary experiments. In the first, we use nonparametric binscatter and linear regression techniques to analyze over 29,000 evaluations of 1,220 anonymized papers drawn from 110 economics journals excluded from the training data of current LLMs, along with a set of AI-generated submissions. The results show that LLMs consistently distinguish between higher- and lower-quality research based solely on textual content, producing quality gradients that closely align with established journal prestige measures. Claude and Gemma perform exceptionally well in capturing these gradients, while GPT excels in detecting AI-generated content. The second experiment comprises 8,910 evaluations designed to assess whether LLMs replicate human like biases in single blind reviews. By systematically varying author gender, institutional affiliation, and academic prominence across 330 papers, we find that GPT, Gemma, and LLaMA assign significantly higher ratings to submissions from top male authors and elite institutions relative to the same papers presented anonymously. These results emphasize the importance of excluding author-identifying information when deploying LLMs in editorial screening. Overall, our findings provide compelling evidence and practical guidance for integrating LLMs into peer review to enhance efficiency, improve accuracy, and promote equity in the publication process of economics research.

Summary

  • The paper demonstrates LLMs' ability to differentiate paper quality, analyzing 27,000 evaluations across 9,000 submissions.
  • It reveals a 2-3% bias favoring authors from prestigious institutions and male authors, highlighting significant fairness concerns.
  • The paper advocates integrating AI with human oversight to combine efficiency and nuanced judgment in the peer review process.

Evaluating the Potential of AI in Economic Peer Review

The paper "Can AI Solve the Peer Review Crisis? A Large-Scale Experiment on LLM's Performance and Biases in Evaluating Economics Papers" investigates whether LLMs can address the peer review challenges in economics. By analyzing over 27,000 evaluations of approximately 9,000 unique submissions, the paper delves deeply into the capabilities and limitations of LLMs, offering a nuanced perspective on the integration of AI in the peer review process within its current paradigms.

The researchers embarked on this paper with several objectives: to evaluate the efficacy of LLMs in distinguishing between high, medium, and low-quality economics papers; to assess the presence and extent of biases linked to author characteristics such as affiliation, reputation, and gender; and to propose possible solutions for the equitable deployment of AI in peer review.

The findings of this comprehensive analysis initially suggest that LLMs are proficient in differentiating paper quality across established journal hierarchies. This proficiency implies a potential reduction in editorial workloads, particularly during the initial screening phases, thereby alleviating some of the bottlenecks within the traditional peer review process. However, the technology's susceptibility to bias, a significant concern, remains evident throughout the paper. LLMs appear to favor submissions from authors affiliated with prestigious institutions and male authors, demonstrating a 2-3% bias premium in their favor. Furthermore, the models struggle to effectively distinguish between genuinely high-quality papers and sophisticated AI-generated submissions.

The experimental methodology employed in this paper is noteworthy. By systematically varying author characteristics and simulating the peer review process with actual submissions from different journal tiers, the authors established a robust testing environment. The use of papers published between 2024-2025, including AI-generated ones designed to match high-quality standards, ensured objectivity in LLM evaluations. By employing ordinary least squares (OLS) regression and ordered logit models, the researchers carefully quantified the biases and assessment capabilities of the LLM across multiple dimensions of academic success, such as citations, funding competitiveness, and conference acceptances.

Despite the efficiency gains in detecting paper quality, the presence of biases and limitations in discerning AI-generated papers stresses the need for a cautious approach. The authors advocate integrating LLMs within a hybrid peer review framework that combines both AI and human assessments to mitigate these biases. This hybrid model could capitalize on the efficiency advantages of LLMs while retaining human judgment for nuanced decisions and the critical evaluation of data integrity.

The theoretical model proposed delineates the complex interaction between AI and human biases in peer review. By leveraging AI's ability to process large volumes of textual information rapidly, editors could significantly enhance the initial stages of paper evaluation. However, intrinsic biases due to single-blind reviews and the LLM's reliance on authorial data highlight the necessity for deliberate calibration of AI algorithms. Suggestions include the training of AI models on anonymized datasets and the application of bias corrections in AI scores, which can contribute to the development of more equitable AI systems in academic settings.

The implications of this research are expansive, touching both practical and theoretical realms. Practically, LLMs present a solution to the long-standing inefficiencies within the peer review process by expediting preliminary evaluations. Theoretically, the paper contributes to an enriched understanding of bias dynamics within AI systems, offering a platform for future research to explore similar integrations of AI within academic fields beyond economics. Moreover, the results encourage academic journals to explore AI's implementation while maintaining a firm commitment to transparency and fairness to safeguard the integrity and inclusivity of scholarly publishing.