Don't mention it: An approach to assess challenges to using software mentions for citation and discoverability research (2402.14602v1)
Abstract: Datasets collecting software mentions from scholarly publications can potentially be used for research into the software that has been used in the published research, as well as into the practice of software citation. Recently, new software mention datasets with different characteristics have been published. We present an approach to assess the usability of such datasets for research on research software. Our approach includes sampling and data preparation, manual annotation for quality and mention characteristics, and annotation analysis. We applied it to two software mention datasets for evaluation based on qualitative observation. Doing this, we were able to find challenges to working with the selected datasets to do research. Main issues refer to the structure of the dataset, the quality of the extracted mentions (54% and 23% of mentions respectively are not to software), and software accessibility. While one dataset does not provide links to mentioned software at all, the other does so in a way that can impede quantitative research endeavors: (1) Links may come from different sources and each point to different software for the same mention. (2) The quality of the automatically retrieved links is generally poor (in our sample, 65.4% link the wrong software). (3) Links exist only for a small subset (in our sample, 20.5%) of mentions, which may lead to skewed or disproportionate samples. However, the greatest challenge and underlying issue in working with software mention datasets is the still suboptimal practice of software citation: Software should not be mentioned, it should be cited following the software citation principles.
- Jean-François Abramatic, Roberto Di Cosmo and Stefano Zacchiroli “Building the Universal Archive of Source Code” In Communications of the ACM 61.10, 2018-10-01, 2018, pp. 29–31 DOI: 10.1145/3183558
- “Looking before Leaping: Creating a Software Registry” In Journal of Open Research Software 3.1, 2015 DOI: 10.5334/jors.bv
- Iz Beltagy, Kyle Lo and Arman Cohan “SciBERT: A Pretrained Language Model for Scientific Text” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3613–3618 DOI: 10.18653/v1/D19-1371
- “Advancing Software Citation Implementation (Software Citation Workshop 2022)”, 2023 DOI: 10.48550/arXiv.2302.07500
- “Recognising the Value of Software: How Libraries Can Help the Adoption of Software Citation” figshare, 2021 DOI: 10.6084/m9.figshare.14825268.v1
- “Software Citation Checklist for Authors”, 2019 DOI: 10.5281/zenodo.3479199
- “Software Citation Checklist for Developers”, 2019 DOI: 10.5281/zenodo.3482769
- “FAIR Principles for Research Software (FAIR4RS Principles)” In Research Data Alliance Research Data Alliance, 2021 DOI: 10.15497/RDA00065
- Wolfgang Dalitz, Wolfram Sperber and Hagen Chrapary “swMATH: A Publication-Based Approach to Mathematical Software” In SIAM Newsletter Volume 53.Number 06 — July/August 2020, 2020 DOI: 10.12752/8009
- Stephan Druskat “Software and Dependencies in Research Citation Graphs” In Computing in Science & Engineering 22.2, 2020, pp. 8–21 DOI: 10.1109/MCSE.2019.2952840
- Stephan Druskat and Neil Chue Hong “Don’t Mention It: Challenges to Using Software Mentions to Investigate Citation and Discoverability - Data and Notebooks” Zenodo, 2023 DOI: 10.5281/zenodo.5518122
- Stephan Druskat, Daniel S. Katz and Ilian T. Todorov “Research Software Sustainability and Citation” In 2021 IEEE/ACM International Workshop on Body of Knowledge for Software Sustainability (BoKSS), 2021, pp. 1–2 DOI: 10.1109/BoKSS52540.2021.00008
- “Citation File Format” Zenodo, 2021 DOI: 10.5281/ZENODO.5171937
- “Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications” In Journal of the Association for Information Science and Technology 72.7, 2021, pp. 870–884 DOI: 10.1002/asi.24454
- “Understanding Progress in Software Citation: A Study of Software Citation in the CORD-19 Corpus” In PeerJ Computer Science 8 PeerJ Inc., 2022, pp. e1022 DOI: 10.7717/peerj-cs.1022
- Claudia Eitzen “Research Software - Publication and Sustainability”, 2020 URL: https://oceanrep.geomar.de/51354/
- Emily Escamilla “Extract-URLs”, Software Heritage, 2023 URL: swh:1:snp:689cdf3440075d94e250b1da9f9e9d43c4efe675
- “The Rise of GitHub in Scholarly Publications” In Linking Theory and Practice of Digital Libraries 13541 Cham: Springer International Publishing, 2022, pp. 187–200 DOI: 10.1007/978-3-031-16802-4˙15
- European Organization For Nuclear Research and OpenAIRE “Zenodo” CERN, 2013 DOI: 10.25495/7GXK-RD71
- “Toward Research Software Engineering Research”, 2023 DOI: 10.5281/zenodo.8020525
- “DOI Registrations for Software” DataCite, 2018 DOI: 10.5438/1NMY-9902
- “Scikit-Learn/Scikit-Learn: Scikit-learn 1.2.2”, Zenodo, 2023 DOI: 10.5281/zenodo.7711792
- “Array Programming with NumPy” In Nature 585, 2020, pp. 357–362 DOI: 10.1038/s41586-020-2649-2
- “Software in the Scientific Literature: Problems with Seeing, Finding, and Using Software Mentioned in the Biology Literature” In Journal of the Association for Information Science and Technology 67.9, 2015, pp. 2137–2155 DOI: 10.1002/asi.23538
- J.D. Hunter “Matplotlib: A 2D Graphics Environment” In Computing in Science & Engineering 9.3 IEEE COMPUTER SOC, 2007, pp. 90–95 DOI: 10.1109/MCSE.2007.55
- “A Large Dataset of Software Mentions in the Biomedical Literature” arXiv, 2022 DOI: 10.48550/arXiv.2209.00693
- “CZ Software Mentions: A Large Dataset of Software Mentions in the Biomedical Literature” Dryad, 2022, pp. 4251992169 bytes DOI: 10.5061/DRYAD.6WWPZGN2C
- Caroline Jay, Robert Haines and Daniel S. Katz “Software Must Be Recognised as an Important Output of Scholarly Research” In International Journal of Digital Curation 16.1, 2021, pp. 6 DOI: 10.2218/ijdc.v16i1.745
- “CodeMeta: An Exchange Schema for Software Metadata”, KNB Data Repository, 2017 DOI: 10.5063/SCHEMA/CODEMETA-2.0
- “Software Citation Implementation Challenges”, 2019 DOI: 10.48550/arXiv.1905.08674
- “Recognizing the Value of Software: A Software Citation Guide” In F1000Research 9, 2021, pp. 1257 DOI: 10.12688/f1000research.26932.2
- “Jupyter Notebooks – a Publishing Format for Reproducible Computational Workflows” In Positioning and Power in Academic Publishing: Players, Agents and Agendas IOS Press, 2016, pp. 87–90
- “Introducing Habeas Corpus: A Collaborations Workshop 2021 HackDay Project”, 2021 URL: https://archive.softwareheritage.org/swh:1:cnt:d2e0fa61ff93704cbca9638d911d601498667019;origin=https://github.com/softwaresaved/habeas-corpus;visit=swh:1:snp:17af7c58cac3146d7d6ab468175a10bf86b1126d;anchor=swh:1:rev:8cfd4ebb43dd6406fd413fd77be136899c72cc56;path=/docs/CW21%5C_Habeas%5C_Corpus%5C_Presentation.pdf
- Lynn Kurnatowski, Martin Stoffers and Carina Haupt “Provenance based software dashboards” In Workshop on the Science of Scientific-Software Development and Use, 2021 URL: https://elib.dlr.de/147617/
- “OpenCitations, an Infrastructure Organization for Open Scholarship” In Quantitative Science Studies 1.1, 2020, pp. 428–444 DOI: 10.1162/qss˙a˙00023
- “Nine Best Practices for Research Software Registries and Repositories: A Concise Guide” In arXiv:2012.13117 [cs], 2020 arXiv: http://arxiv.org/abs/2012.13117
- “SoftwareKG-PMC” Zenodo, 2021 DOI: 10.5281/ZENODO.5713973
- “SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles” In Proceedings of the 30th ACM International Conference on Information & Knowledge Management Virtual Event Queensland Australia: ACM, 2021, pp. 4574–4583 DOI: 10.1145/3459637.3482017
- “The Role of Software in Science: A Knowledge Graph-Based Analysis of Software Mentions in PubMed Central” In PeerJ Computer Science 8 PeerJ Inc., 2022, pp. e835 DOI: 10.7717/peerj-cs.835
- “LIBER 2021 Session #3: Working with Software & Data” Zenodo, 2021 DOI: 10.5281/ZENODO.5036311
- Arfon Smith “Enhanced Support for Citations on GitHub”, 2021 URL: https://github.blog/2021-08-19-enhanced-support-citations-github/
- “Software Citation Principles” In PeerJ Computer Science 2.e86, 2016 DOI: 10.7717/peerj-cs.86
- “Research Software Directory”, Zenodo, 2020 DOI: 10.5281/ZENODO.1154130
- “The ABC of Software Engineering Research” In ACM Transactions on Software Engineering and Methodology 27.3, 2018, pp. 1–51 DOI: 10.1145/3241743
- The Dask 2023.3.1 developers “Dask”, 2023 URL: https://pypi.org/project/dask/2023.3.1
- The Jupyter Notebook 6.5.3 developers “Jupyter Notebook (Version 6.5.3-e3e14a1)”, 2023 URL: https://pypi.org/project/notebook/6.5.3/
- The Matplotlib 3.7.1 developers “Matplotlib”, 2023 URL: https://pypi.org/project/matplotlib/3.7.1/
- The nbconvert 7.2.10 developers “Nbconvert”, 2023 URL: https://pypi.org/project/nbconvert/7.2.10
- The NumPy 1.24.2 developers “NumPy”, 2023 URL: https://pypi.org/project/numpy/1.24.2/
- The pandas development team “Pandas-Dev/Pandas: Pandas (Version 1.5.3)”, Zenodo, 2023 DOI: 10.5281/zenodo.7549438
- The Research Software Encyclopedia project “Research Software Encyclopedia”, 2021 URL: https://rseng.github.io/software/
- The SciPy 1.10.1 developers “SciPy”, 2021 URL: https://pypi.org/project/scipy/1.10.1/
- The Software Mention Extraction authors “Software Mention Extraction and Linking from Scientific Articles”, 2022 URL: https://archive.softwareheritage.org/swh:1:cnt:149c13ef5b99c3ad52903305daee718bd09bfff7;origin=https://github.com/chanzuckerberg/software-mention-extraction;visit=swh:1:snp:592c9f600b721e8d56a59bcb424bf9a969a942fc;anchor=swh:1:rev:99296e04c9a982b5f31d4bf2dd33fec3894a385e;path=/README.md;lines=7-8
- “Practice Meets Principle: Tracking Software and Data Citations to Zenodo DOIs” In arXiv:1911.00295 [cs], 2019 arXiv: http://arxiv.org/abs/1911.00295
- Boris Veytsman “Software Mentions Extractor (Lines 65-69)”, 2022 URL: https://archive.softwareheritage.org/swh:1:cnt:398bf69be085934e489d2a598ac19a44c92b3b4a;origin=https://github.com/chanzuckerberg/software-mentions;visit=swh:1:snp:699d5878f3cfc43c3fc57c98dd19455a370359d2;anchor=swh:1:rev:561ae152ff5e064fed85b9327afbeec36439fa63;path=/software-mentions-extractor/software-mentions-extractor.py;lines=65-69
- “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python” In Nature Methods 17, 2020, pp. 261–272 DOI: 10.1038/s41592-019-0686-2
- Alex D. Wade and Ivana Williams “CORD-19 Software Mentions” Dryad, 2021, pp. 31878512 bytes DOI: 10.5061/DRYAD.VMCVDNCS0
- “CORD-19: The COVID-19 Open Research Dataset” In NLPCOVID19, 2020