Does the Use of Unusual Combinations of Datasets Contribute to Greater Scientific Impact? (2402.05024v4)
Abstract: Scientific datasets play a crucial role in contemporary data-driven research, as they allow for the progress of science by facilitating the discovery of new patterns and phenomena. This mounting demand for empirical research raises important questions on how strategic data utilization in research projects can stimulate scientific advancement. In this study, we examine the hypothesis inspired by the recombination theory, which suggests that innovative combinations of existing knowledge, including the use of unusual combinations of datasets, can lead to high-impact discoveries. Focusing on social science, we investigate the scientific outcomes of such atypical data combinations in more than 30,000 publications that leverage over 5,000 datasets curated within one of the largest social science databases, ICPSR. This study offers four important insights. First, combining datasets, particularly those infrequently paired, significantly contributes to both scientific and broader impacts (e.g., dissemination to the general public). Second, infrequently paired datasets maintain a strong association with citation even after controlling for the atypicality of dataset topics. In contrast, the atypicality of dataset topics has a much smaller positive impact on citation counts. Third, smaller and less experienced research teams tend to use atypical combinations of datasets in research more frequently than their larger and more experienced counterparts. Lastly, despite the benefits of data combination, papers that amalgamate data remain infrequent. This finding suggests that the unconventional combination of datasets is an under-utilized but powerful strategy correlated with the scientific impact and broader dissemination of scientific discoveries
- Big Data: A Revolution that Will Transform how We Live, Work, and Think. Houghton Mifflin Harcourt, 2013.
- Virginia Gewin. Data sharing: An open mind on open data. Nature, 529(7584):117–119, January 2016.
- Making data maximally available. Science, 331(6018):649, February 2011.
- Open innovation in SMEs: a systematic literature review. Journal of Strategy and Management, 9(1):58–73, January 2016.
- Peter Murray-Rust. Open data in science. Nature Precedings, pages 1–1, January 2008.
- FAIR data enabling new horizons for materials research. Nature, 604(7907):635–642, April 2022.
- Assessment of the impact of shared brain imaging data on the scientific literature. Nat. Commun., 9(1):2818, July 2018.
- Data sharing by scientists: practices and perceptions. PloS one, 6(6):e21101, 2011.
- Data sharing, management, use, and reuse: Practices and perceptions of scientists worldwide. PLoS One, 15(3):e0229003, March 2020.
- The OpenNeuro resource for sharing of neuroscience data. Elife, 10, October 2021.
- Open access to data: An ideal professed but not practised. Res. Policy, 43(9):1621–1633, November 2014.
- Psychology, not technology, is our biggest challenge to open digital morphology data. Sci Data, 6(1):41, April 2019.
- Ready, set, share! Science, 379(6630):322–325, 2023.
- Data sharing platforms and the academic evaluation system. EMBO reports, 21(8):e50690, 2020.
- Progress on open science: towards a shared research knowledge system. final report of the open science policy platform. 2020.
- Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLoS One, 6(11):e26828, November 2011.
- Data-intensive science: A new paradigm for biodiversity studies. Bioscience, 59(7):613–620, July 2009.
- What drives academic data sharing? PLoS One, 10(2):e0118053, February 2015.
- Internalizing externalities: Designing effective data policies. AEA Papers and Proceedings, 110:49–54, May 2020.
- On the reuse of scientific data. 2017.
- Uses and reuses of scientific data: The data creators’ advantage. 2019.
- Direct, orienting, and scenic paths: How users navigate search in a research data archive. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval, pages 128–136, 2023.
- DataFinder: Scientific dataset recommendation from natural language descriptions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10288–10303, Toronto, Canada, July 2023. Association for Computational Linguistics.
- What types of novelty are most disruptive? Am. Sociol. Rev.
- Atypical combinations and scientific impact. Science, 342(6157):468–472, October 2013.
- Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines. Nat. Commun., 14(1):1641, March 2023.
- Tradition and innovation in scientists’ research strategies. Am. Sociol. Rev., 2015.
- New directions in science emerge from disconnection and discord. J. Informetr., 2022.
- A data citation roadmap for scholarly data repositories. Scientific data, 6(1):28, 2019.
- Bringing citations and usage metrics together to make data count. Data Science Journal, 18:9–9, 2019.
- Opaque data citation: Actual citation practice and its implication for tracking data use. Poster presented at the 13th International Digital Curation Conference …, 2018.
- Catherine Blake. Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles. Journal of biomedical informatics, 43(2):173–189, 2010.
- The increasing dominance of teams in production of knowledge. Science, 316(5827):1036–1039, 2007.
- Katherine W McCain. Mapping authors in intellectual space: A technical overview. Journal of the American Society for Information Science (1986-1998), 41(6):433, 1990.
- Universality of citation distributions: Toward an objective measure of scientific impact. Proceedings of the National Academy of Sciences, 105(45):17268–17272, 2008.
- Sidney Redner. How popular is your paper? an empirical study of the citation distribution. The European Physical Journal B-Condensed Matter and Complex Systems, 4(2):131–134, 1998.
- Andy Stirling. A general framework for analysing diversity in science, technology and society. J. R. Soc. Interface, 4(15):707–719, August 2007.
- Gender-diverse teams produce more novel and higher-impact scientific ideas. Proc. Natl. Acad. Sci. U. S. A., 119(36):e2200841119, September 2022.
- Is science becoming more interdisciplinary? measuring and mapping six research fields over time. Scientometrics, 81(3):719–745, 2009.
- Atypical combinations and scientific impact. Science, 342(6157):468–472, 2013.
- Large teams develop and small teams disrupt science and technology. Nature, 566(7744):378–382, 2019.
- Benjamin F Jones. The burden of knowledge and the “death of the renaissance man”: Is innovation getting harder? The Review of Economic Studies, 76(1):283–317, 2009.
- The novelty paradox & bias for normal science: Evidence from randomized medical grant proposal evaluations. Harvard Business School working paper series# 13-053, 2012.
- Measuring technological novelty with patent-based indicators. Res. Policy, 45(3):707–723, April 2016.
- Bias against novelty in science: A cautionary tale for users of bibliometric indicators. Res. Policy, 2017.
- Creativity in scientific teams: Unpacking novelty and impact. Res. Policy, 44(3):684–697, April 2015.
- Large teams develop and small teams disrupt science and technology. Nature, 566(7744):378–382, February 2019.
- Do we measure novelty when we analyze unusual combinations of cited references? a validation study of bibliometric novelty indicators based on F1000Prime data. J. Informetr., 13(4):100979, November 2019.
- Measuring novelty in science with word embedding. PLoS One, 16(7):e0254034, July 2021.
- What types of novelty are most disruptive? American Sociological Review, 88(3):562–597, 2023.
- Data integration enables global biodiversity synthesis. Proc. Natl. Acad. Sci. U. S. A., 118(6), February 2021.
- Subdivisions and crossroads: Identifying hidden community structures in a data archive’s citation network. Quantitative Science Studies, 3(3):694–714, 2022.
- Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833, 2022.