Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 26 tok/s Pro
GPT-4o 98 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 216 tok/s Pro
2000 character limit reached

Reproducibility in NLP: What Have We Learned from the Checklist? (2306.09562v1)

Published 16 Jun 2023 in cs.CL

Abstract: Scientific progress in NLP rests on the reproducibility of researchers' claims. The *CL conferences created the NLP Reproducibility Checklist in 2020 to be completed by authors at submission to remind them of key information to include. We provide the first analysis of the Checklist by examining 10,405 anonymous responses to it. First, we find evidence of an increase in reporting of information on efficiency, validation performance, summary statistics, and hyperparameters after the Checklist's introduction. Further, we show acceptance rate grows for submissions with more Yes responses. We find that the 44% of submissions that gather new data are 5% less likely to be accepted than those that did not; the average reviewer-rated reproducibility of these submissions is also 2% lower relative to the rest. We find that only 46% of submissions claim to open-source their code, though submissions that do have 8% higher reproducibility score relative to those that do not, the most for any item. We discuss what can be inferred about the state of reproducibility in NLP, and provide a set of recommendations for future conferences, including: a) allowing submitting code and appendices one week after the deadline, and b) measuring dataset reproducibility by a checklist of data collection practices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Estimating the reproducibility of psychological science. Science, 349(6251):aac4716.
  2. Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature, 533:452–454.
  3. Emily M. Bender. 2019. The #benderrule: On naming the languages we study and why it matters.
  4. Emily M. Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604.
  5. The plos one collection on machine learning in health and biomedicine: Towards open code and open data. PloS one, 14(1):e0210232.
  6. Kamalika Chaudhuri and Ruslan Salakhutdinov. 2019. The icml 2019 code-at-submit-time experiment.
  7. Association between author metadata and acceptance: A feature-rich, matched observational study of a corpus of iclr submissions between 2017-2022.
  8. Show your work: Improved reporting of experimental results. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2185–2194, Hong Kong, China. Association for Computational Linguistics.
  9. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  10. Jesse Dodge and Noah A. Smith. 2020. Guest post: Reproducibility at EMNLP 2020.
  11. Toward standard practices for sharing computer code and programs in neuroscience. Nature neuroscience, 20(6):770–773.
  12. Offspring from reproduction problems: What replication failure teaches us. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1691–1701, Sofia, Bulgaria. Association for Computational Linguistics.
  13. Datasheets for datasets. Communications of the ACM, 64:86 – 92.
  14. Odd Erik Gundersen and Sigbjørn Kjensmo. 2018. State of the art: Reproducibility in artificial intelligence. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
  15. Transparency and reproducibility in artificial intelligence. Nature, 586 7829:E14–E16.
  16. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.
  17. Scientific credibility of machine translation research: A meta-evaluation of 769 papers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7297–7306, Online. Association for Computational Linguistics.
  18. Reproducibility in machine learning for health.
  19. What do nlp researchers believe? results of the nlp community metasurvey.
  20. Nature. 2018. Checklists work to improve science. Nature, 556(7701):273–274.
  21. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  22. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program). ArXiv, abs/2003.12206.
  23. Data cards: Purposeful and transparent dataset documentation for responsible ai. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 1776–1826, New York, NY, USA. Association for Computing Machinery.
  24. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5389–5400. PMLR.
  25. Anna Rogers. 2021. Changing the world by changing the data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2182–2194, Online. Association for Computational Linguistics.
  26. ‘just what do you think you’re doing, dave?’ a checklist for responsible data use in NLP. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4821–4833, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  27. Practices in source code sharing in astrophysics. Astronomy and Computing, 1:54–58.
  28. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272.
Citations (10)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper reveals that higher checklist adherence generally increases acceptance rates, except when all items are marked 'YES' which may signal misreporting.
  • Detailed documentation of datasets and code significantly enhances perceived reproducibility, as evidenced by an 8% increase in scores for submissions that open source code.
  • The study advocates for revised guidelines and delayed code submission deadlines to further improve research transparency and address gaps in checklist coverage.

Reproducibility in NLP: What Have We Learned from the Checklist?

Introduction

The paper "Reproducibility in NLP: What Have We Learned from the Checklist?" examines the impact of the NLP Reproducibility Checklist on improving transparency and reproducibility in NLP research. By analyzing response data from 10,405 checklist submissions at major NLP conferences, the authors explore trends in scientific reporting and acceptance rates. It emphasizes the vital role of reproducibility in scientific progress, establishing that the introduction of the checklist has influenced the quality and acceptance of submissions based on detailed and transparent reporting.

Analysis of Reproducibility Checklist Impact

The introduction of the NLP Reproducibility Checklist has shown significant positive trends in reporting practices and acceptance rates. The correlation between the number of "YES" responses and acceptance rates indicates that submissions adhering to checklist recommendations are more likely to be accepted (Figures 1 and 2). Specifically, submissions that include complete documentation of datasets and code tend to have higher reproducibility scores, which are positively correlated with higher acceptance rates conditioned on perceived reproducibility scores (Figure 1). However, the paper also identifies a decline in acceptance rates when all checklist items are marked "YES," suggesting possible instances of inaccurate self-reporting. Figure 2

Figure 2: Submissions to EMNLP 2021 binned by count of YES responses to the NLP Reproducibility Checklist items. Papers with more YES responses are more likely to be accepted, except those with all YES responses, potentially indicating misreported information.

Figure 3

Figure 3: Acceptance rate among submissions binned by count of YES responses. The trend shows a positive correlation, highlighting the importance of detailed reporting.

Figure 1

Figure 1: Acceptance rates across quantiles for perceived reproducibility and overall recommendations, demonstrating a positive trend with acceptance.

Data Collection and Open Sourcing Code

The paper reveals that submissions involving new data collection have lower acceptance rates and perceived reproducibility scores. Despite 46% of submissions asserting they open-source their code, the impact is considerable, as these submissions demonstrate an 8% increase in perceived reproducibility scores. Addressing the 5% lower acceptance rate for new data submissions, the authors advocate for enhanced guidelines on dataset documentation and dissemination practices to bridge identified gaps in checklist coverage. Figure 4

Figure 4: Acceptance rates for submissions with various responses, showing that certain checklist items are not fully embraced as community norms yet.

Figure 5

Figure 5: Reviewer perceived reproducibility score for different responses, with LINKTOCODE associated with higher scores.

Code Availability and Impact

Despite the slightly reduced acceptance rates for submissions refusing to share code, a strong correlation emerges between code availability and perceived reproducibility scores. Comparisons with other conferences, like NeurIPS 2019, show similar trends in code availability rates at submission, with potential discrepancies based on self-reporting inaccuracies. Figure 6

Figure 6: Efficiency response patterns when source code is unavailable, highlighting a gap in reporting efficiency measures.

Recommendations for Future Improvements

The authors propose several strategies to enhance reproducibility practices further. By allowing code and appendices submission one week after paper deadlines, conferences could support complete documentation without penalizing rushed submissions. The transparency of checklist responses should be improved by making them publicly accessible, creating accountability, and highlighting the critical sections of papers to reviewers.

Conclusion

The analysis of the NLP Reproducibility Checklist demonstrates a significant impact on improving research reporting practices and perceived reproducibility within the NLP community. To sustain and enhance these improvements, the paper suggests increased incentives for sharing code and datasets, along with revisions to checklist items to reflect evolving community standards. Future developments should consider incorporating detailed documentation practices for data-driven NLP projects and adapting checklist structures to maintain their relevance as research practices evolve.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Youtube Logo Streamline Icon: https://streamlinehq.com