Peer Reviews of Peer Reviews: A Randomized Controlled Trial and Other Experiments (2311.09497v2)

Published 16 Nov 2023 in cs.DL and cs.GT

Abstract: Is it possible to reliably evaluate the quality of peer reviews? We study this question driven by two primary motivations -- incentivizing high-quality reviewing using assessed quality of reviews and measuring changes to review quality in experiments. We conduct a large scale study at the NeurIPS 2022 conference, a top-tier conference in machine learning, in which we invited (meta)-reviewers and authors to evaluate reviews given to submitted papers. First, we conduct a RCT to examine bias due to the length of reviews. We generate elongated versions of reviews by adding substantial amounts of non-informative content. Participants in the control group evaluate the original reviews, whereas participants in the experimental group evaluate the artificially lengthened versions. We find that lengthened reviews are scored (statistically significantly) higher quality than the original reviews. In analysis of observational data we find that authors are positively biased towards reviews recommending acceptance of their own papers, even after controlling for confounders of review length, quality, and different numbers of papers per author. We also measure disagreement rates between multiple evaluations of the same review of 28%-32%, which is comparable to that of paper reviewers at NeurIPS. Further, we assess the amount of miscalibration of evaluators of reviews using a linear model of quality scores and find that it is similar to estimates of miscalibration of paper reviewers at NeurIPS. Finally, we estimate the amount of variability in subjective opinions around how to map individual criteria to overall scores of review quality and find that it is roughly the same as that in the review of papers. Our results suggest that the various problems that exist in reviews of papers -- inconsistency, bias towards irrelevant factors, miscalibration, subjectivity -- also arise in reviewing of reviews.

PDF Abstract

Critical Evaluation of "Peer Reviews of Peer Reviews: A Randomized Controlled Trial and Other Experiments"

This paper investigates the evaluation of peer review quality, addressing significant practical questions within the academic community. With the aims of incentivizing high-quality reviewing and assessing interventions in the peer-review process, the paper demonstrates the complexities involved in evaluating the efficacy of peer reviews. The authors executed a comprehensive paper involving over 7,000 participants at the NeurIPS 2022 conference, rigorously analyzing review quality assessments from multiple perspectives, thereby generating several insightful findings.

Key Findings

Uselessly Elongated Review Bias: Through a randomized controlled trial, the researchers demonstrated a positive bias towards longer reviews, even when the additional content adds no substantive value. Specifically, the elongated reviews received a mean score increase of nearly 0.5 points on a 7-point scale compared to their original counterparts, with a significant effect size (τ = 0.64, p < 0.0001). This suggests that reviewers might perceive longer reviews as more thorough without considering the informativeness of the content.
Author-Outcome Bias: Analysis of the observational data indicated that authors rated reviews more favorably if they recommended acceptance of their papers. The mean score for "accept" reviews was 1.4 points higher than "reject" reviews. This bias was statistically significant across all evaluation criteria, implying that author evaluations might not be entirely objective when they directly affect their outcomes.
Inter-evaluator (Dis)agreement: The paper measured disagreement rates between different evaluators of the same reviews and found them to range between 28% and 32%, akin to the disagreement rates in peer reviews of papers. This highlights consistency issues in reviewing peer reviews, comparable to those encountered in paper evaluations.
Miscalibration: The evaluation process exhibited miscalibration among evaluators, similar to the issues found in paper reviews. Evaluators showed idiosyncratic biases, suggesting variance in scoring that needs addressing to enhance the reliability of peer review quality assessments.
Subjectivity: Subjectivity was analyzed by mapping individual criteria to overall scores, revealing variability similar to paper review assessments. The paper showcased that subjective differences in evaluator opinions persist in determining review quality, stressing the need for standardized evaluation metrics.

Implications and Future Directions

The implications of this paper are multifaceted. The demonstrated biases and inconsistencies call for improved methodologies in evaluating peer reviews. There is a clear need for automated or semi-automated mechanisms to support objective quality assessments. Exploring machine learning models or natural language processing tools could provide innovative solutions in evaluating reviews, minimizing subjectivity and human biases.

Moreover, for policy interventions and incentive designs to be fair and effective, addressing these biases is crucial. For example, journal and conference policies that recognize top reviewers need to incorporate mechanisms that mitigate the bias introduced by review length and author-outcome preferences.

The reliance on human evaluations as a "gold standard" in experimental settings in peer review processes warrants reevaluation. The biases identified may skew the perceived efficacy of novel peer review methodologies, such as the use of LLMs in reviewing tasks, potentially leading to inaccurate conclusions regarding their utility.

This paper provides critical insights that could propel efforts toward refining peer review processes in academia. Future research should focus on quantifying the impact of suggested improvements and exploring alternative methods of review quality assessment that enhance the integrity and reliability of peer review systems in scholarly publishing.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Alexander Goldberg (6 papers)
Ivan Stelmakh (16 papers)
Kyunghyun Cho (292 papers)
Alice Oh (81 papers)
Alekh Agarwal (99 papers)
Danielle Belgrave (6 papers)
Nihar B. Shah (73 papers)

Citations (13)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/MishaTeplitskiy/status/1790827859814269359

https://twitter.com/jenstirrup/status/1791049706350514248