CASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions (2403.00241v2)
Abstract: Writing a scientific article is a challenging task as it is a highly codified and specific genre, consequently proficiency in written communication is essential for effectively conveying research findings and ideas. In this article, we propose an original textual resource on the revision step of the writing process of scientific articles. This new dataset, called CASIMIR, contains the multiple revised versions of 15,646 scientific articles from OpenReview, along with their peer reviews. Pairs of consecutive versions of an article are aligned at sentence-level while keeping paragraph location information as metadata for supporting future revision studies at the discourse level. Each pair of revised sentences is enriched with automatically extracted edits and associated revision intention. To assess the initial quality on the dataset, we conducted a qualitative study of several state-of-the-art text revision approaches and compared various evaluation metrics. Our experiments led us to question the relevance of the current evaluation methods for the text revision task.
- 2008–2023. Grobid. https://github.com/kermitt2/grobid.
- Stephen Bailey. 2014. Academic writing: A handbook for international students. Routledge.
- Nougat: Neural optical understanding for academic documents.
- Samir Bourekkache. 2022. English for specific purposes: writing scientific research papers. case study: Phd students in the computer science department. Master’s thesis, University of Biskra, Algeria.
- Readability Revisited: The New Dale-Chall Readability Formula. Brookline Books.
- Leshem Choshen and Omri Abend. 2018. Automatic metric validation for grammatical error correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1372–1382, Melbourne, Australia. Association for Computational Linguistics.
- Aries: A corpus of scientific paper edits made in response to peer reviews.
- Read, revise, repeat: A system demonstration for human-in-the-loop iterative text revision. In Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022), pages 96–108, Dublin, Ireland. Association for Computational Linguistics.
- Understanding iterative revision from human-written text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3573–3590, Dublin, Ireland. Association for Computational Linguistics.
- Editeval: An instruction-based benchmark for text improvements. arXiv.
- WikiAtomicEdits: A multilingual corpus of Wikipedia edits for modeling language and discourse. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 305–315, Brussels, Belgium. Association for Computational Linguistics.
- Coh-metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5):223–234.
- Diamonds in the rough: Generating fluent sentences from early-stage drafts for academic writing assistance. In Proceedings of the 12th International Conference on Natural Language Generation, pages 40–53, Tokyo, Japan. Association for Computational Linguistics.
- arxivedits: Understanding the human revision process in scientific writing. In Proceedings of EMNLP 2022.
- Text revision in scientific writing assistance: A review. In 13th International Workshop on Bibliometric-enhanced Information Retrieval (BIR), number 3617 in CEUR Workshop Proceedings, pages 22–36, Aachen.
- Elena D Kallestinova. 2011. How to write your first research paper. The Yale journal of biology and medicine, 84(3):181.
- A dataset of peer reviews (PeerRead): Collection, insights and NLP applications. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1647–1661, New Orleans, Louisiana. Association for Computational Linguistics.
- Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review. Computational Linguistics, 48(4):949–986.
- Ekaning Dewanti Laksmi. 2006. “scaffolding” students’ writing in efl class: Implementing process approach. TEFLIN Journal, 17(2):144–156.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Lei Liu and Min Zhu. 2022. Bertalign: Improved word embedding-based sentence alignment for Chinese–English parallel corpora of literary texts. Digital Scholarship in the Humanities, 38(2):621–634.
- Towards automated document revision: Grammatical error correction, fluency edits, and beyond.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- CoEdIT: Text editing by task-specific instruction tuning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5274–5291, Singapore. Association for Computational Linguistics.
- Anthony Seow. 2002. The writing process and process writing. Methodology in language teaching: An anthology of current practice, 315:320.
- VILA: Improving structured content extraction from scientific PDFs using visual layout groups. Transactions of the Association for Computational Linguistics, 10:376–392.
- Guide for scientific writing: how to avoid common mistakes in a scientific article. Journal of Human Growth and Development, 32(3):341–352.
- John M. Swales. 1990. Genre Analysis: English in academic and research settings. The Cambridge applied linguistics series. The press syndicate of the University of Cambridge.
- Automatic document sketching: Generating drafts from analogous texts. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2102–2113, Online. Association for Computational Linguistics.
- Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
- Identifying semantic edit intentions from revisions in Wikipedia. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2000–2010, Copenhagen, Denmark. Association for Computational Linguistics.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Leane Jourdan (2 papers)
- Florian Boudin (28 papers)
- Nicolas Hernandez (9 papers)
- Richard Dufour (33 papers)