Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Don't mention it: An approach to assess challenges to using software mentions for citation and discoverability research (2402.14602v1)

Published 22 Feb 2024 in cs.SE

Abstract: Datasets collecting software mentions from scholarly publications can potentially be used for research into the software that has been used in the published research, as well as into the practice of software citation. Recently, new software mention datasets with different characteristics have been published. We present an approach to assess the usability of such datasets for research on research software. Our approach includes sampling and data preparation, manual annotation for quality and mention characteristics, and annotation analysis. We applied it to two software mention datasets for evaluation based on qualitative observation. Doing this, we were able to find challenges to working with the selected datasets to do research. Main issues refer to the structure of the dataset, the quality of the extracted mentions (54% and 23% of mentions respectively are not to software), and software accessibility. While one dataset does not provide links to mentioned software at all, the other does so in a way that can impede quantitative research endeavors: (1) Links may come from different sources and each point to different software for the same mention. (2) The quality of the automatically retrieved links is generally poor (in our sample, 65.4% link the wrong software). (3) Links exist only for a small subset (in our sample, 20.5%) of mentions, which may lead to skewed or disproportionate samples. However, the greatest challenge and underlying issue in working with software mention datasets is the still suboptimal practice of software citation: Software should not be mentioned, it should be cited following the software citation principles.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Jean-François Abramatic, Roberto Di Cosmo and Stefano Zacchiroli “Building the Universal Archive of Source Code” In Communications of the ACM 61.10, 2018-10-01, 2018, pp. 29–31 DOI: 10.1145/3183558
  2. “Looking before Leaping: Creating a Software Registry” In Journal of Open Research Software 3.1, 2015 DOI: 10.5334/jors.bv
  3. Iz Beltagy, Kyle Lo and Arman Cohan “SciBERT: A Pretrained Language Model for Scientific Text” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3613–3618 DOI: 10.18653/v1/D19-1371
  4. “Advancing Software Citation Implementation (Software Citation Workshop 2022)”, 2023 DOI: 10.48550/arXiv.2302.07500
  5. “Recognising the Value of Software: How Libraries Can Help the Adoption of Software Citation” figshare, 2021 DOI: 10.6084/m9.figshare.14825268.v1
  6. “Software Citation Checklist for Authors”, 2019 DOI: 10.5281/zenodo.3479199
  7. “Software Citation Checklist for Developers”, 2019 DOI: 10.5281/zenodo.3482769
  8. “FAIR Principles for Research Software (FAIR4RS Principles)” In Research Data Alliance Research Data Alliance, 2021 DOI: 10.15497/RDA00065
  9. Wolfgang Dalitz, Wolfram Sperber and Hagen Chrapary “swMATH: A Publication-Based Approach to Mathematical Software” In SIAM Newsletter Volume 53.Number 06 — July/August 2020, 2020 DOI: 10.12752/8009
  10. Stephan Druskat “Software and Dependencies in Research Citation Graphs” In Computing in Science & Engineering 22.2, 2020, pp. 8–21 DOI: 10.1109/MCSE.2019.2952840
  11. Stephan Druskat and Neil Chue Hong “Don’t Mention It: Challenges to Using Software Mentions to Investigate Citation and Discoverability - Data and Notebooks” Zenodo, 2023 DOI: 10.5281/zenodo.5518122
  12. Stephan Druskat, Daniel S. Katz and Ilian T. Todorov “Research Software Sustainability and Citation” In 2021 IEEE/ACM International Workshop on Body of Knowledge for Software Sustainability (BoKSS), 2021, pp. 1–2 DOI: 10.1109/BoKSS52540.2021.00008
  13. “Citation File Format” Zenodo, 2021 DOI: 10.5281/ZENODO.5171937
  14. “Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications” In Journal of the Association for Information Science and Technology 72.7, 2021, pp. 870–884 DOI: 10.1002/asi.24454
  15. “Understanding Progress in Software Citation: A Study of Software Citation in the CORD-19 Corpus” In PeerJ Computer Science 8 PeerJ Inc., 2022, pp. e1022 DOI: 10.7717/peerj-cs.1022
  16. Claudia Eitzen “Research Software - Publication and Sustainability”, 2020 URL: https://oceanrep.geomar.de/51354/
  17. Emily Escamilla “Extract-URLs”, Software Heritage, 2023 URL: swh:1:snp:689cdf3440075d94e250b1da9f9e9d43c4efe675
  18. “The Rise of GitHub in Scholarly Publications” In Linking Theory and Practice of Digital Libraries 13541 Cham: Springer International Publishing, 2022, pp. 187–200 DOI: 10.1007/978-3-031-16802-4˙15
  19. European Organization For Nuclear Research and OpenAIRE “Zenodo” CERN, 2013 DOI: 10.25495/7GXK-RD71
  20. “Toward Research Software Engineering Research”, 2023 DOI: 10.5281/zenodo.8020525
  21. “DOI Registrations for Software” DataCite, 2018 DOI: 10.5438/1NMY-9902
  22. “Scikit-Learn/Scikit-Learn: Scikit-learn 1.2.2”, Zenodo, 2023 DOI: 10.5281/zenodo.7711792
  23. “Array Programming with NumPy” In Nature 585, 2020, pp. 357–362 DOI: 10.1038/s41586-020-2649-2
  24. “Software in the Scientific Literature: Problems with Seeing, Finding, and Using Software Mentioned in the Biology Literature” In Journal of the Association for Information Science and Technology 67.9, 2015, pp. 2137–2155 DOI: 10.1002/asi.23538
  25. J.D. Hunter “Matplotlib: A 2D Graphics Environment” In Computing in Science & Engineering 9.3 IEEE COMPUTER SOC, 2007, pp. 90–95 DOI: 10.1109/MCSE.2007.55
  26. “A Large Dataset of Software Mentions in the Biomedical Literature” arXiv, 2022 DOI: 10.48550/arXiv.2209.00693
  27. “CZ Software Mentions: A Large Dataset of Software Mentions in the Biomedical Literature” Dryad, 2022, pp. 4251992169 bytes DOI: 10.5061/DRYAD.6WWPZGN2C
  28. Caroline Jay, Robert Haines and Daniel S. Katz “Software Must Be Recognised as an Important Output of Scholarly Research” In International Journal of Digital Curation 16.1, 2021, pp. 6 DOI: 10.2218/ijdc.v16i1.745
  29. “CodeMeta: An Exchange Schema for Software Metadata”, KNB Data Repository, 2017 DOI: 10.5063/SCHEMA/CODEMETA-2.0
  30. “Software Citation Implementation Challenges”, 2019 DOI: 10.48550/arXiv.1905.08674
  31. “Recognizing the Value of Software: A Software Citation Guide” In F1000Research 9, 2021, pp. 1257 DOI: 10.12688/f1000research.26932.2
  32. “Jupyter Notebooks – a Publishing Format for Reproducible Computational Workflows” In Positioning and Power in Academic Publishing: Players, Agents and Agendas IOS Press, 2016, pp. 87–90
  33. “Introducing Habeas Corpus: A Collaborations Workshop 2021 HackDay Project”, 2021 URL: https://archive.softwareheritage.org/swh:1:cnt:d2e0fa61ff93704cbca9638d911d601498667019;origin=https://github.com/softwaresaved/habeas-corpus;visit=swh:1:snp:17af7c58cac3146d7d6ab468175a10bf86b1126d;anchor=swh:1:rev:8cfd4ebb43dd6406fd413fd77be136899c72cc56;path=/docs/CW21%5C_Habeas%5C_Corpus%5C_Presentation.pdf
  34. Lynn Kurnatowski, Martin Stoffers and Carina Haupt “Provenance based software dashboards” In Workshop on the Science of Scientific-Software Development and Use, 2021 URL: https://elib.dlr.de/147617/
  35. “OpenCitations, an Infrastructure Organization for Open Scholarship” In Quantitative Science Studies 1.1, 2020, pp. 428–444 DOI: 10.1162/qss˙a˙00023
  36. “Nine Best Practices for Research Software Registries and Repositories: A Concise Guide” In arXiv:2012.13117 [cs], 2020 arXiv: http://arxiv.org/abs/2012.13117
  37. “SoftwareKG-PMC” Zenodo, 2021 DOI: 10.5281/ZENODO.5713973
  38. “SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles” In Proceedings of the 30th ACM International Conference on Information & Knowledge Management Virtual Event Queensland Australia: ACM, 2021, pp. 4574–4583 DOI: 10.1145/3459637.3482017
  39. “The Role of Software in Science: A Knowledge Graph-Based Analysis of Software Mentions in PubMed Central” In PeerJ Computer Science 8 PeerJ Inc., 2022, pp. e835 DOI: 10.7717/peerj-cs.835
  40. “LIBER 2021 Session #3: Working with Software & Data” Zenodo, 2021 DOI: 10.5281/ZENODO.5036311
  41. Arfon Smith “Enhanced Support for Citations on GitHub”, 2021 URL: https://github.blog/2021-08-19-enhanced-support-citations-github/
  42. “Software Citation Principles” In PeerJ Computer Science 2.e86, 2016 DOI: 10.7717/peerj-cs.86
  43. “Research Software Directory”, Zenodo, 2020 DOI: 10.5281/ZENODO.1154130
  44. “The ABC of Software Engineering Research” In ACM Transactions on Software Engineering and Methodology 27.3, 2018, pp. 1–51 DOI: 10.1145/3241743
  45. The Dask 2023.3.1 developers “Dask”, 2023 URL: https://pypi.org/project/dask/2023.3.1
  46. The Jupyter Notebook 6.5.3 developers “Jupyter Notebook (Version 6.5.3-e3e14a1)”, 2023 URL: https://pypi.org/project/notebook/6.5.3/
  47. The Matplotlib 3.7.1 developers “Matplotlib”, 2023 URL: https://pypi.org/project/matplotlib/3.7.1/
  48. The nbconvert 7.2.10 developers “Nbconvert”, 2023 URL: https://pypi.org/project/nbconvert/7.2.10
  49. The NumPy 1.24.2 developers “NumPy”, 2023 URL: https://pypi.org/project/numpy/1.24.2/
  50. The pandas development team “Pandas-Dev/Pandas: Pandas (Version 1.5.3)”, Zenodo, 2023 DOI: 10.5281/zenodo.7549438
  51. The Research Software Encyclopedia project “Research Software Encyclopedia”, 2021 URL: https://rseng.github.io/software/
  52. The SciPy 1.10.1 developers “SciPy”, 2021 URL: https://pypi.org/project/scipy/1.10.1/
  53. The Software Mention Extraction authors “Software Mention Extraction and Linking from Scientific Articles”, 2022 URL: https://archive.softwareheritage.org/swh:1:cnt:149c13ef5b99c3ad52903305daee718bd09bfff7;origin=https://github.com/chanzuckerberg/software-mention-extraction;visit=swh:1:snp:592c9f600b721e8d56a59bcb424bf9a969a942fc;anchor=swh:1:rev:99296e04c9a982b5f31d4bf2dd33fec3894a385e;path=/README.md;lines=7-8
  54. “Practice Meets Principle: Tracking Software and Data Citations to Zenodo DOIs” In arXiv:1911.00295 [cs], 2019 arXiv: http://arxiv.org/abs/1911.00295
  55. Boris Veytsman “Software Mentions Extractor (Lines 65-69)”, 2022 URL: https://archive.softwareheritage.org/swh:1:cnt:398bf69be085934e489d2a598ac19a44c92b3b4a;origin=https://github.com/chanzuckerberg/software-mentions;visit=swh:1:snp:699d5878f3cfc43c3fc57c98dd19455a370359d2;anchor=swh:1:rev:561ae152ff5e064fed85b9327afbeec36439fa63;path=/software-mentions-extractor/software-mentions-extractor.py;lines=65-69
  56. “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python” In Nature Methods 17, 2020, pp. 261–272 DOI: 10.1038/s41592-019-0686-2
  57. Alex D. Wade and Ivana Williams “CORD-19 Software Mentions” Dryad, 2021, pp. 31878512 bytes DOI: 10.5061/DRYAD.VMCVDNCS0
  58. “CORD-19: The COVID-19 Open Research Dataset” In NLPCOVID19, 2020
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com