Papers
Topics
Authors
Recent
Search
2000 character limit reached

Computational Reproducibility in Computational Social Science

Published 4 Jul 2023 in cs.CY | (2307.01918v4)

Abstract: Replication crises have shaken the scientific landscape during the last decade. As potential solutions, open science practices were heavily discussed and have been implemented with varying success in different disciplines. We argue that computational-x disciplines such as computational social science, are also susceptible for the symptoms of the crises, but in terms of reproducibility. We expand the binary definition of reproducibility into a tier system which allows increasing levels of reproducibility based on external verfiability to counteract the practice of open-washing. We provide solutions for barriers in Computational Social Science that hinder researchers from obtaining the highest level of reproducibility, including the use of alternate data sources and considering reproducibility proactively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology, 73(1):719–748, 2022. ISSN 1545-2085. doi:10.1146/annurev-psych-020821-114157. URL http://dx.doi.org/10.1146/annurev-psych-020821-114157.
  2. Open Science Collaboration. Estimating the reproducibility of psychological science. Science, 349(6251), 2015. ISSN 1095-9203. doi:10.1126/science.aac4716. URL http://dx.doi.org/10.1126/science.aac4716.
  3. Alexander Wuttke. Why too many political science findings cannot be trusted and what we can do about it: A review of meta-scientific research and a call for academic reform. Politische Vierteljahresschrift, 60(1):1–19, 2018. ISSN 1862-2860. doi:10.1007/s11615-018-0131-7. URL http://dx.doi.org/10.1007/s11615-018-0131-7.
  4. Psychology, science, and knowledge construction: Broadening perspectives from the replication crisis. Annual Review of Psychology, 69(1):487–510, 2018. doi:10.1146/annurev-psych-122216-011845. URL https://doi.org/10.1146/annurev-psych-122216-011845.
  5. A manifesto for reproducible science. Nature Human Behaviour, 1(1):1–9, 2017. ISSN 2397-3374. doi:10.1038/s41562-016-0021.
  6. Open Science now: A systematic literature review for an integrated definition. Journal of Business Research, 88:428–436, 2018. ISSN 0148-2963. doi:10.1016/j.jbusres.2017.12.043.
  7. Computational social science. Science, 323(5915):721–723, 2009. doi:10.1126/science.1167742. URL https://www.science.org/doi/abs/10.1126/science.1167742.
  8. Lorena A Barba. Terminologies for reproducible research. arXiv preprint arXiv:1802.03311, 2018.
  9. The Turing Way Community. The Turing Way: A handbook for reproducible, ethical and collaborative research, 2022.
  10. The End of the Rehydration Era - The Problem of Sharing Harmful Twitter Research Data. ICWSM, Jun 2023. doi:10.36190/2023.56. URL https://doi.org/10.36190/2023.56.
  11. Don Grady. The golden age of data: media analytics in study & practice. Routledge, 2019.
  12. Deen Freelon. Computational research in the post-api age. Political Communication, 35(4):665–668, 2018. ISSN 1091-7675. doi:10.1080/10584609.2018.1477506. URL http://dx.doi.org/10.1080/10584609.2018.1477506.
  13. Rebekah Tromble. Where Have All the Data Gone? A Critical Reflection on Academic Digital Research in the Post-API Age. Social Media + Society, 7(1):2056305121988929, 2021. ISSN 2056-3051. doi:10.1177/2056305121988929.
  14. Is the sample good enough? comparing data from twitter’s streaming api with twitter’s firehose. In Proceedings of the International AAAI Conference on Web and Social Media, volume 7, pages 400–408, 2013.
  15. Social media apis: A quiet threat to the advancement of science, 2023. URL psyarxiv.com/ps32z.
  16. Botometer 101: Social bot practicum for computational social scientists. Journal of Computational Social Science, pages 1–18, 2022.
  17. The false positive problem of automatic bot detection in social science research. PLOS ONE, 15(10):e0241045, 2020. ISSN 1932-6203. doi:10.1371/journal.pone.0241045. URL http://dx.doi.org/10.1371/journal.pone.0241045.
  18. How is ChatGPT’s behavior changing over time? arXiv preprint arXiv:2307.09009, 2023.
  19. Digital trace data collection for social media effects research: APIs, data donation, and (screen) tracking. Communication Methods and Measures, page 1–18, 2023. ISSN 1931-2466. doi:10.1080/19312458.2023.2181319. URL http://dx.doi.org/10.1080/19312458.2023.2181319.
  20. vcr: Record ’HTTP’ Calls to Disk, 2023. URL https://CRAN.R-project.org/package=vcr. R package version 1.2.2.
  21. The Availability of Research Data Declines Rapidly with Article Age. Current Biology, 24(1):94–97, 2014. ISSN 0960-9822. doi:10.1016/j.cub.2013.11.014.
  22. Data sharing practices and data availability upon request differ across scientific disciplines. Scientific Data, 8(1), 2021. ISSN 2052-4463. doi:10.1038/s41597-021-00981-0. URL http://dx.doi.org/10.1038/s41597-021-00981-0.
  23. Mandated data archiving greatly improves access to research data. The FASEB Journal, 27(4):1304–1308, 2013. ISSN 0892-6638, 1530-6860. doi:10.1096/fj.12-218164.
  24. Lisa M. Federer. Long-term availability of data associated with articles in PLOS ONE. PLOS ONE, 17(8):e0272845, 2022. ISSN 1932-6203. doi:10.1371/journal.pone.0272845.
  25. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  26. Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators. Proceedings of the 5th International Conference on Conversational User Interfaces, Jul 2023. doi:10.1145/3571884.3604316. URL http://dx.doi.org/10.1145/3571884.3604316.
  27. Misclassification in automated content analysis causes bias in regression. can we fix it? yes we can! arXiv preprint arXiv:2307.06483, 2023.
  28. Who is doing computational social science? trends in big data research. 2016. URL https://repository.essex.ac.uk/17679/1/compsocsci.pdf.
  29. A survey of researchers’ code sharing and code reuse practices, and assessment of interactive notebook prototypes. PeerJ, 10:e13933, 2022. ISSN 2167-8359. doi:10.7717/peerj.13933.
  30. Mary Elizabeth Sutherland. Computational social science heralds the age of interdisciplinary science. https://socialsciences.nature.com/posts/54262-computational-social-science-heralds-the-age-of-interdisciplinary-science/, 2018. [Online; accessed 05-May-2023].
  31. How do scientists develop and use scientific software? 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering, 2009. doi:10.1109/secse.2009.5069155. URL http://dx.doi.org/10.1109/SECSE.2009.5069155.
  32. Best Practices for Scientific Computing. PLOS Biology, 12(1):e1001745, 2014. ISSN 1545-7885. doi:10.1371/journal.pbio.1001745.
  33. rang: Reconstructing reproducible r computational environments. PLOS ONE, 18(6):e0286761, 2023. ISSN 1932-6203. doi:10.1371/journal.pone.0286761. URL http://dx.doi.org/10.1371/journal.pone.0286761.
  34. quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30):774, 2018. ISSN 2475-9066. doi:10.21105/joss.00774. URL http://dx.doi.org/10.21105/joss.00774.
  35. A large-scale study on research code quality and execution. Scientific Data, 9(1), 2022. ISSN 2052-4463. doi:10.1038/s41597-022-01143-6. URL http://dx.doi.org/10.1038/s41597-022-01143-6.
  36. Greg Wilson. Software carpentry: Getting scientists to write better code by making them more productive. Computing in Science & Engineering, 2006. Summarizes the what and why of Version 3 of the course.
  37. D. E. Knuth. Literate Programming. The Computer Journal, 27(2):97–111, 1984. ISSN 0010-4620. doi:10.1093/comjnl/27.2.97. URL https://doi.org/10.1093/comjnl/27.2.97.
  38. A multi-language computing environment for literate programming and reproducible research. Journal of Statistical Software, 46(3), 2012. ISSN 1548-7660. doi:10.18637/jss.v046.i03. URL http://dx.doi.org/10.18637/jss.v046.i03.
  39. Digits: Two Reports on New Units of Scholarly Publication. The Journal of Electronic Publishing, 22(1), 2020. ISSN 1080-2711. doi:10.3998/3336451.0022.105.
  40. Binder 2.0-reproducible, interactive, sharable environments for science at scale. In Proceedings of the 17th python in science conference, pages 113–120. F. Akici, D. Lippa, D. Niederhut, and M. Pacer, eds., 2018.
  41. Design choices for productive, secure, data-intensive research at scale in the cloud. arXiv preprint arXiv:1908.08737, 2019.
  42. Paving the way for data-centric, open science: An example from the social sciences. Journal of Librarianship and Scholarly Communication, 3(2), 2015.
Citations (1)

Summary

  • The paper introduces a tiered system categorizing computational reproducibility levels to provide clearer verification standards and counteract open-washing.
  • The paper details practical measures such as employing Docker virtualization, open-source software, and data donations to mitigate economic and access barriers.
  • The paper highlights challenges from external dependencies, including APIs and proprietary tools, which complicate maintaining reproducible research outcomes.

Computational Reproducibility in Computational Social Science

Introduction

The paper "Computational Reproducibility in Computational Social Science" delineates the challenges associated with ensuring reproducibility in computational social science (CSS). It critiques the binary conception of computational reproducibility and introduces a nuanced tier system that categorizes different levels of reproducibility based on external verifiability. This classification is proposed in response to the practices termed as "open-washing," where research may deceptively appear reproducible. Through this tier system, the paper addresses barriers inhibiting the achievement of high reproducibility standards, proposing both practical and precautionary measures to ameliorate these obstacles.

Computational Reproducibility: Definitions and Challenges

Reproducibility in computational contexts is differentiated from replicability, where the former hinges on the availability of original data and the original computational environment. Traditional definitions may fail to encapsulate the dimensions critical to computational reproducibility: the agent capable of performing the reproducibility check and the computational environment used. The paper proposes a first-order computational reproducibility (1° CR) that allows verification by both original investigators and third parties under specified conditions.

Tier System of Computational Reproducibility

The paper introduces a tiered system that categorizes reproducibility into three orders. First-order (1° CR) assumes verification by the authors or designated third parties with access to nonrestrictive materials. Second-order (2° CR) involves verification by trusted third-party agents, often necessary in cases dealing with sensitive data. Third-order (3° CR) extends to a general reproducibility standard accessible to all under suitable conditions. This tier system is a response to circumvent the ambiguity and vague assurances around reproducibility in scholarly publications. Figure 1

Figure 1: A declarative description of a computational environment (Dockerfile) generated by rang.

Barriers to Reproducibility: External Dependencies and Opacity

The paper identifies external dependencies, particularly on APIs and proprietary systems, as significant barriers to reproducibility. Dependencies on opaque systems like the Twitter API or ChatGPT pose challenges due to their evolving nature and lack of transparency. The sudden change in accessibility, such as Twitter API's policy alterations, often leaves research irreproducible. Additionally, reliance on proprietary software amplifies the economic burdens for researchers endeavoring reproducibility. The material access restrictions transform into significant computational obstacles.

Practical Solutions and Recommendations

To counteract the barriers, the paper suggests alternatives like data donations and industry-research collaborations to ensure free data access. It recommends employing open-source software as alternatives to proprietary systems to alleviate economic constraints on reproducibility. Archiving data outputs and using educational initiatives to improve coding practices for reproducibility is emphasized. Embracing virtualization systems, such as Docker, is proposed to encapsulate computational environments as a comprehensive reproducibility strategy (Figure 2). Figure 2

Figure 2: Recommended practices to achieve the maximum degree of computational reproducibility in different scenarios.

Conclusion

The research articulates that ensuring reproducibility in CSS necessitates a fundamental rethinking of data sharing and computational practices. By adopting the proposed tier system, researchers can more accurately communicate their reproducibility potential while addressing open-washing tendencies. The paper advocates for incentivizing proactive reproducibility strategies, such as standardized sharing of computational resources, as research practice norms. These structural changes are essential to foster a reproducibility-centric scientific culture that aligns with the proposed procedural clarifications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.