From Model Performance to Claim: How a Change of Focus in Machine Learning Replicability Can Help Bridge the Responsibility Gap (2404.13131v1)
Abstract: Two goals - improving replicability and accountability of Machine Learning research respectively, have accrued much attention from the AI ethics and the Machine Learning community. Despite sharing the measures of improving transparency, the two goals are discussed in different registers - replicability registers with scientific reasoning whereas accountability registers with ethical reasoning. Given the existing challenge of the Responsibility Gap - holding Machine Learning scientists accountable for Machine Learning harms due to them being far from sites of application, this paper posits that reconceptualizing replicability can help bridge the gap. Through a shift from model performance replicability to claim replicability, Machine Learning scientists can be held accountable for producing non-replicable claims that are prone to eliciting harm due to misuse and misinterpretation. In this paper, I make the following contributions. First, I define and distinguish two forms of replicability for ML research that can aid constructive conversations around replicability. Second, I formulate an argument for claim-replicability's advantage over model performance replicability in justifying assigning accountability to Machine Learning scientists for producing non-replicable claims and show how it enacts a sense of responsibility that is actionable. In addition, I characterize the implementation of claim replicability as more of a social project than a technical one by discussing its competing epistemological principles, practical implications on Circulating Reference, Interpretative Labor, and research communication.
- ACM. 2018. The code affirms an obligation of computing professionals to use their skills for the benefit of society. https://www.acm.org/code-of-ethics
- ACM. 2020. Artifact review and badging - current. https://www.acm.org/publications/policies/artifact-review-and-badging-current
- Reproducibility of Machine Learning: Terminology, Recommendations and Open Issues. arXiv preprint arXiv:2302.12691 (2023).
- Rachel A Ankeny and Sabina Leonelli. 2016. Repertoires: A post-Kuhnian perspective on scientific change and collaborative research. Studies in History and Philosophy of Science Part A 60 (2016), 18–28.
- Disentangling the components of ethical research in machine learning. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2057–2068.
- Ai ethics statements: analysis and lessons learnt from neurips broader impact statements. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2047–2056.
- The road to explainability is paved with bias: Measuring the fairness of explanations. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1194–1206.
- The hidden assumptions behind counterfactual explanations and principal reasons. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 80–89.
- Explainable machine learning in deployment. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 648–657.
- Reuben Binns. 2020. On the apparent conflict between individual and group fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 514–524.
- The values encoded in machine learning research. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 173–184.
- Carl Boettiger. 2015. An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review 49, 1 (2015), 71–79.
- James Bogen. 2000. Two as good as a hundred’: Poorly replicated evidence in some nineteenth-century neuroscientific research. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences 32, 3 (2000).
- Mark Bovens. 2007. Analysing and assessing accountability: A conceptual framework 1. European law journal 13, 4 (2007), 447–468.
- Harms from Increasingly Agentic Algorithmic Systems. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 651–666.
- Harry Collins. 1992. Changing order: Replication and induction in scientific practice. University of Chicago Press.
- Harry M Collins. 1975. The seven sexes: A study in the sociology of a phenomenon, or the replication of experiments in physics. Sociology 9, 2 (1975), 205–224.
- Accountability in an algorithmic society: relationality, responsibility, and robustness in machine learning. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 864–876.
- A Feder Cooper and Gili Vidan. 2022. Making the Unaccountable Internet: The Changing Meaning of Accounting in the Early ARPANET. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 726–742.
- Eric Corbett and Emily Denton. 2023. Interrogating the T in FAccT. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 1624–1634.
- Underspecification presents challenges for credibility in modern machine learning. The Journal of Machine Learning Research 23, 1 (2022), 10237–10297.
- Haixin Dang and Liam Kofi Bright. 2021. Scientific conclusions need not be accurate, justified, or believed by their authors. Synthese 199, 3 (2021), 8187–8203.
- CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (Seoul, Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 2342–2351. https://doi.org/10.1145/3531146.3534647
- David Donoho. 2017. 50 years of data science. Journal of Computational and Graphical Statistics 26, 4 (2017), 745–766.
- Heather Douglas. 2009. Science, policy, and the value-free ideal. University of Pittsburgh Pre.
- John Downer. 2007. When the chick hits the fan: representativeness and reproducibility in technological tests. Social Studies of Science 37, 1 (2007), 7–26.
- Anna Dreber and Magnus Johannesson. 2019. Statistical significance and the replication crisis in the social sciences. In Oxford research encyclopedia of economics and finance.
- Expanding explainability: Towards social transparency in ai systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–19.
- The Craft and Code Binary: Before, During, and After. Osiris 38, 1 (2023), 19–39.
- Uljana Feest. 2016. The experimenters’ regress reconsidered: Replication, tacit knowledge, and the dynamics of knowledge generation. Studies in History and Philosophy of Science Part A 58 (2016), 34–45.
- Uljana Feest. 2019. Why replication is overrated. Philosophy of Science 86, 5 (2019), 895–905.
- Romero Felipe. 2019. The Division of Replication Labor. http://philsci-archive.pitt.edu/16472/ forthcoming, Philosophy of Science.
- Fiona Fidler and John Wilcox. 2021. Reproducibility of Scientific Results. In The Stanford Encyclopedia of Philosophy (Summer 2021 ed.), Edward N. Zalta (Ed.). Metaphysics Research Lab, Stanford University.
- Melissa Flagg. 2022. Reward research for being useful-not just flashy. Nature 610, 7930 (2022), 9–9.
- Samuel C Fletcher. 2022. Replication Is for Meta-Analysis. Philosophy of Science 89, 5 (2022), 960–969.
- Allan Franklin. 1998. Avoiding the experimenters’ regress. A house built on sand: Exposing postmodernist myths about science (1998), 151–65.
- AI Opacity and Explainability in Tort Litigation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 185–196.
- Miranda Fricker. 2007. Epistemic injustice: Power and the ethics of knowing. Oxford University Press.
- Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1747–1764.
- Trystan S Goetze. 2021. Moral Entanglement: Taking Responsibility and Vicarious Responsibility. The Monist 104, 2 (2021), 210–223.
- Trystan S. Goetze. 2022. Mind the Gap: Autonomous Systems, the Responsibility Gap, and Moral Entanglement. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (Seoul, Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 390–400. https://doi.org/10.1145/3531146.3533106
- What does research reproducibility mean? Science translational medicine 8, 341 (2016), 341ps12–341ps12.
- David Graeber. 2012. Dead zones of the imagination: On violence, bureaucracy, and interpretive labor: The Malinowski Memorial Lecture, 2006. HAU: journal of Ethnographic Theory 2, 2 (2012), 105–128.
- Ben Green. 2021. Data science as political action: Grounding data science in a politics of justice. Journal of Social Computing 2, 3 (2021), 249–265.
- Ben Green and Salomé Viljoen. 2020. Algorithmic realism: expanding the boundaries of algorithmic thought. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 19–31.
- Gabriel Grill. 2022. Constructing certainty in machine learning: On the performativity of testing and its hold on the future. (2022).
- Odd Erik Gundersen. 2021. The fundamental principles of reproducibility. Philosophical Transactions of the Royal Society A 379, 2197 (2021), 20200210.
- Odd Erik Gundersen and Sigbjørn Kjensmo. 2018. State of the art: Reproducibility in artificial intelligence. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
- David H Guston. 2000. Between politics and science: Assuring the integrity and productivity of research. (2000).
- Stephan Guttinger. 2019. A new account of replication in the experimental life sciences. Philosophy of Science 86, 3 (2019), 453–471.
- Stephan Guttinger. 2020. The limits of replicability. European Journal for Philosophy of Science 10, 2 (2020), 10.
- Leif Hancox-Li and Capital One. 2020. Beyond Methods Reproducibility in Machine Learning. In ML-Retrospectives, Surveys & Meta-Analyses Workshop at NeurIPS.
- John Heil. 1983. Believing what one ought. The Journal of Philosophy 80, 11 (1983), 752–765.
- Witold M Hensel. 2020. Double trouble? The communication dimension of the reproducibility crisis in experimental psychology and neuroscience. European Journal for Philosophy of Science 10, 3 (2020), 44.
- The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards. CoRR abs/1805.03677 (2018). arXiv:1805.03677 http://arxiv.org/abs/1805.03677
- Michael J Hones. 1990. Reproducibility as a methodological imperative in experimental research. In PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association, Vol. 1990. Cambridge University Press, 585–599.
- Is there a replication crisis in medical education research? Academic Medicine 96, 7 (2021), 958–963.
- Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 560–575. https://doi.org/10.1145/3442188.3445918
- N. Institution. 2023. Sense 6.a. Oxford English Dictionary. https://doi.org/10.1093/OED/4488691117
- Abigail Z Jacobs and Hanna Wallach. 2021. Measurement and fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 375–385.
- Sheila Jasanoff. 2004. States of knowledge: the co-production of science and the social order. Routledge.
- Sheila Jasanoff. 2005. Technologies of humility: Citizen participation in governing science. Springer.
- Sheila Jasanoff and Sang-Hyun Kim. 2015. Dreamscapes of modernity: Sociotechnical imaginaries and the fabrication of power. University of Chicago Press.
- Deborah G Johnson. 2006. Computer systems: Moral entities but not moral agents. Ethics and information technology 8 (2006), 195–204.
- Margot E Kaminski and Gianclaudio Malgieri. 2020. Algorithmic impact assessments under the GDPR: producing multi-layered explanations. International data privacy law (2020), 19–28.
- Reforms: Reporting standards for machine learning based science. arXiv preprint arXiv:2308.07832 (2023).
- Sayash Kapoor and Arvind Narayanan. 2022. Leakage and the reproducibility crisis in ML-based science. arXiv preprint arXiv:2207.07048 (2022).
- Philip Kitcher. 2001. Science, truth, and democracy. Oxford University Press.
- Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science 1, 4 (2018), 443–490.
- Goodbye tracking? Impact of iOS app tracking transparency and privacy labels. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 508–520.
- Joshua A. Kroll. 2021. Outlining Traceability: A Principle for Operationalizing Accountability in Computing Systems. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 758–771. https://doi.org/10.1145/3442188.3445937
- Thomas S Kuhn. 1997. The structure of scientific revolutions. Vol. 962. University of Chicago press Chicago.
- Bruno Latour. 1983. Give me a laboratory and I will raise the world. Science observed: Perspectives on the social study of science (1983), 141–170.
- Bruno Latour and Steve Woolgar. 2013. Laboratory life: The construction of scientific facts. Princeton university press.
- Etienne P LeBel and Kurt R Peters. 2011. Fearing the future of empirical psychology: Bem’s (2011) evidence of psi as a case study of deficiencies in modal research practice. Review of General Psychology 15, 4 (2011), 371–379.
- Sabina Leonelli. 2009. On the locality of data and claims about phenomena. Philosophy of Science 76, 5 (2009), 737–749.
- Sabina Leonelli. 2018. Rethinking reproducibility as a criterion for research quality. In Including a symposium on Mary Morgan: curiosity, imagination, and surprise, Vol. 36. Emerald Publishing Limited, 129–146.
- Sabina Leonelli. 2023. Philosophy of open science. (2023).
- Isaac Levi. 1960. Must the scientist make value judgments? The Journal of philosophy 57, 11 (1960), 345–357.
- Isaac Levi. 1962. On the seriousness of mistakes. Philosophy of Science 29, 1 (1962), 47–65.
- Trustworthy AI: From principles to practices. Comput. Surveys 55, 9 (2023), 1–46.
- The conflict between explainable and accountable decision-making algorithms. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2103–2113.
- Replication markets: Results, lessons, challenges and opportunities in ai replication. arXiv preprint arXiv:2005.04543 (2020).
- Bertram Ludäscher. 2016. A brief tour through provenance in scientific workflows and databases. In Building trust in information: Perspectives on the frontiers of provenance. Springer, 103–126.
- Edouard Machery. 2020. What is a replication? Philosophy of Science 87, 4 (2020), 545–567.
- John McCarthy. 1997. AI as sport.
- A normative framework for artificial intelligence as a sociotechnical system in healthcare. Patterns 4, 11 (2023).
- YesWorkflow: a user-oriented, language-independent tool for recovering workflow information from scripts. arXiv preprint arXiv:1502.02403 (2015).
- Algorithmic impact assessments and accountability: The co-construction of impacts. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 735–746.
- From optimizing engagement to measuring value. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 714–722.
- Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency. 220–229.
- Michael Mulkay and G Nigel Gilbert. 1986. Replication and mere replication. Philosophy of the Social Sciences 16, 1 (1986), 21–37.
- Reproducibility and replicability in science. (2019).
- Helen Nissenbaum. 1996. Accountability in a computerized society. Science and engineering ethics 2 (1996), 25–42.
- Disclosure by Design: Designing information disclosures to support meaningful transparency and accountability. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 679–690.
- George Packer. 2013. Change the world. The New Yorker 89, 15 (2013), 44–55.
- Katherine Pandora. 1999. Pandora’s Hope: Essays on the Reality of Science Studies. American Scientist 87, 6 (1999), 570–570.
- Augmented Datasheets for Speech Datasets and Ethical Decision-Making. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (Chicago, IL, USA) (FAccT ’23). Association for Computing Machinery, New York, NY, USA, 881–904. https://doi.org/10.1145/3593013.3594049
- Samir Passi and Steven J Jackson. 2018. Trust in data science: Collaboration, translation, and accountability in corporate data science projects. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 1–28.
- Trevor J Pinch and Wiebe E Bijker. 1984. The social construction of facts and artefacts: Or how the sociology of science and the sociology of technology might benefit each other. Social studies of science 14, 3 (1984), 399–441.
- Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program). The Journal of Machine Learning Research 22, 1 (2021), 7459–7478.
- Lindsay Poirier. 2022. Accountable Data: The Politics and Pragmatics of Disclosure Datasets. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (Seoul, Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 1446–1456. https://doi.org/10.1145/3531146.3533201
- Karl Popper. 2005. The logic of scientific discovery. Routledge.
- Giorgia Pozzi. 2023. Automated opioid risk scores: a case for machine learning-induced epistemic injustice in healthcare. Ethics and Information Technology 25, 1 (2023), 3.
- Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 1776–1826. https://doi.org/10.1145/3531146.3533231
- Hans Radder. 1992. Experimental reproducibility and the experimenters’ regress. In PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association, Vol. 1992. Cambridge University Press, 63–73.
- Hans Radder. 1996. In and about the world: Philosophical studies of science and technology. suny Press.
- Edward Raff. 2019. A step toward quantifying independently reproducible machine learning research. Advances in Neural Information Processing Systems 32 (2019).
- AI and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366 (2021).
- The fallacy of AI functionality. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 959–972.
- Algorithmic Impact Assessments: A Practical Framework for Public Agency. AI Now (2018).
- David Ribes. 2019. How I learned what a domain was. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–12.
- Samantha Robertson and Mark Díaz. 2022. Understanding and Being Understood: User Strategies for Identifying and Recovering From Mistranslations in Machine Translation-Mediated Chat. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2223–2238.
- A meta-analysis of overfitting in machine learning. Advances in Neural Information Processing Systems 32 (2019).
- Thomas M Scanlon. 2008. Moral dimensions: Permissibility, meaning, blame. Harvard University Press.
- Jutta Schickore. 2011. What does history matter to philosophy of science? The concept of replication and the methodology of experiments. Journal of the Philosophy of History 5, 3 (2011), 513–532.
- “There is not enough information”: On the effects of explanations on perceptions of informational fairness and trustworthiness in automated decision-making. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1616–1628.
- Human interpretation of saliency-based explanation over text. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 611–636.
- Fairness and abstraction in sociotechnical systems. In Proceedings of the conference on fairness, accountability, and transparency. 59–68.
- WEIRD FAccTs: How Western, Educated, Industrialized, Rich, and Democratic is FAccT?. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 160–171.
- Why am I not seeing it? Understanding users’ needs for counterfactual explanations in everyday recommendations. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1330–1340.
- Steven Shapin. 2004. The way we trust now: The authority of science and the character of the scientist. (2004).
- Steven Shapin and Simon Schaffer. 2011. Leviathan and the air-pump: Hobbes, Boyle, and the experimental life. Princeton University Press.
- Mona Sloane and Janina Zakrzewski. 2022. German AI Start-Ups and “AI Ethics”: Using A Social Practice Lens for Assessing and Implementing Socio-Technical Innovation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 935–947.
- Prospecting (in) the data sciences. Big Data & Society 7, 1 (2020), 2053951720906849.
- Real ml: Recognizing, exploring, and articulating limitations of machine learning research. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 587–597.
- Victoria Stodden and Sheila Miguez. 2013. Best practices for computational science: Software infrastructure and environments for reproducible and extensible research. Available at SSRN 2322276 (2013).
- Responsible Data Management. Proc. VLDB Endow. 13, 12 (aug 2020), 3474–3488. https://doi.org/10.14778/3415478.3415570
- Eliza Strickland. 2019. IBM Watson, heal thyself: How IBM overpromised and underdelivered on AI health care. IEEE Spectrum 56, 4 (2019), 24–31.
- Honghong Tinn. 2023. Between “Magnificent Machine” and “Elusive Device” Wassily Leontief’s Input-Output Analysis and Its International Applicability. Osiris 38, 1 (2023), 129–146.
- Marie VanNostrand and Gena Keebler. 2009. Pretrial risk assessment in the federal court. Fed. Probation 73 (2009), 3.
- Paul Voigt and Axel von dem Bussche. 2017. The EU General Data Protection Regulation (GDPR): A Practical Guide (1st ed.). Springer Publishing Company, Incorporated.
- Kiri Wagstaff. 2012. Machine learning that matters. arXiv preprint arXiv:1206.4656 (2012).
- Eann: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining. 849–857.
- Pete Warden. 2018. The machine learning reproducibility crisis.
- David Gray Widder and Dawn Nafus. 2023. Dislocated accountabilities in the “AI supply chain”: Modularity and developers’ notions of responsibility. Big Data & Society 10, 1 (2023), 20539517231177620.
- Limits and possibilities for “Ethical AI” in open source: A study of deepfakes. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 2035–2046.
- Towards a multi-stakeholder value-based assessment framework for algorithmic systems. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 535–563.
- TRIM-AI: Harnessing Language Models for Providing Timely Maternal & Neonatal Care in Low-Resource Countries. (2023).
- Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 295–305.
- Eli Zimmerman. 2018. Teachers Are Turning to AI Solutions for Assistance. EdTech Magazine (2018).
- Jonathan Zittrain. 2014. The virtues of procrastination. https://www.internetimpossible.org/virtues-of-procrastination/