Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 26 tok/s Pro

GPT-4o 98 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 216 tok/s Pro

2000 character limit reached

Reproducibility in NLP: What Have We Learned from the Checklist? (2306.09562v1)

Published 16 Jun 2023 in cs.CL

Abstract: Scientific progress in NLP rests on the reproducibility of researchers' claims. The *CL conferences created the NLP Reproducibility Checklist in 2020 to be completed by authors at submission to remind them of key information to include. We provide the first analysis of the Checklist by examining 10,405 anonymous responses to it. First, we find evidence of an increase in reporting of information on efficiency, validation performance, summary statistics, and hyperparameters after the Checklist's introduction. Further, we show acceptance rate grows for submissions with more Yes responses. We find that the 44% of submissions that gather new data are 5% less likely to be accepted than those that did not; the average reviewer-rated reproducibility of these submissions is also 2% lower relative to the rest. We find that only 46% of submissions claim to open-source their code, though submissions that do have 8% higher reproducibility score relative to those that do not, the most for any item. We discuss what can be inferred about the state of reproducibility in NLP, and provide a set of recommendations for future conferences, including: a) allowing submitting code and appendices one week after the deadline, and b) measuring dataset reproducibility by a checklist of data collection practices.

References (28)

Citations (10)

View on Semantic Scholar

Collections

Summary

The paper reveals that higher checklist adherence generally increases acceptance rates, except when all items are marked 'YES' which may signal misreporting.
Detailed documentation of datasets and code significantly enhances perceived reproducibility, as evidenced by an 8% increase in scores for submissions that open source code.
The study advocates for revised guidelines and delayed code submission deadlines to further improve research transparency and address gaps in checklist coverage.

Reproducibility in NLP: What Have We Learned from the Checklist?

Introduction

The paper "Reproducibility in NLP: What Have We Learned from the Checklist?" examines the impact of the NLP Reproducibility Checklist on improving transparency and reproducibility in NLP research. By analyzing response data from 10,405 checklist submissions at major NLP conferences, the authors explore trends in scientific reporting and acceptance rates. It emphasizes the vital role of reproducibility in scientific progress, establishing that the introduction of the checklist has influenced the quality and acceptance of submissions based on detailed and transparent reporting.

Analysis of Reproducibility Checklist Impact

The introduction of the NLP Reproducibility Checklist has shown significant positive trends in reporting practices and acceptance rates. The correlation between the number of "YES" responses and acceptance rates indicates that submissions adhering to checklist recommendations are more likely to be accepted (Figures 1 and 2). Specifically, submissions that include complete documentation of datasets and code tend to have higher reproducibility scores, which are positively correlated with higher acceptance rates conditioned on perceived reproducibility scores (Figure 1). However, the paper also identifies a decline in acceptance rates when all checklist items are marked "YES," suggesting possible instances of inaccurate self-reporting.

Figure 2: Submissions to EMNLP 2021 binned by count of YES responses to the NLP Reproducibility Checklist items. Papers with more YES responses are more likely to be accepted, except those with all YES responses, potentially indicating misreported information.

Figure 3: Acceptance rate among submissions binned by count of YES responses. The trend shows a positive correlation, highlighting the importance of detailed reporting.

Figure 1: Acceptance rates across quantiles for perceived reproducibility and overall recommendations, demonstrating a positive trend with acceptance.

Data Collection and Open Sourcing Code

The paper reveals that submissions involving new data collection have lower acceptance rates and perceived reproducibility scores. Despite 46% of submissions asserting they open-source their code, the impact is considerable, as these submissions demonstrate an 8% increase in perceived reproducibility scores. Addressing the 5% lower acceptance rate for new data submissions, the authors advocate for enhanced guidelines on dataset documentation and dissemination practices to bridge identified gaps in checklist coverage.

Figure 4: Acceptance rates for submissions with various responses, showing that certain checklist items are not fully embraced as community norms yet.

Figure 5: Reviewer perceived reproducibility score for different responses, with LINKTOCODE associated with higher scores.

Code Availability and Impact

Despite the slightly reduced acceptance rates for submissions refusing to share code, a strong correlation emerges between code availability and perceived reproducibility scores. Comparisons with other conferences, like NeurIPS 2019, show similar trends in code availability rates at submission, with potential discrepancies based on self-reporting inaccuracies.

Figure 6: Efficiency response patterns when source code is unavailable, highlighting a gap in reporting efficiency measures.

Recommendations for Future Improvements

The authors propose several strategies to enhance reproducibility practices further. By allowing code and appendices submission one week after paper deadlines, conferences could support complete documentation without penalizing rushed submissions. The transparency of checklist responses should be improved by making them publicly accessible, creating accountability, and highlighting the critical sections of papers to reviewers.

Conclusion

The analysis of the NLP Reproducibility Checklist demonstrates a significant impact on improving research reporting practices and perceived reproducibility within the NLP community. To sustain and enhance these improvements, the paper suggests increased incentives for sharing code and datasets, along with revisions to checklist items to reflect evolving community standards. Future developments should consider incorporating detailed documentation practices for data-driven NLP projects and adapting checklist structures to maintain their relevance as research practices evolve.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (3)

YouTube

Show All Videos