SciOps: Achieving Productivity and Reliability in Data-Intensive Research (2401.00077v2)
Abstract: Scientists are increasingly leveraging advances in instruments, automation, and collaborative tools to scale up their experiments and research goals, leading to new bursts of discovery. Various scientific disciplines, including neuroscience, have adopted key technologies to enhance collaboration, reproducibility, and automation. Drawing inspiration from advancements in the software industry, we present a roadmap to enhance the reliability and scalability of scientific operations for diverse research teams tackling large and complex projects. We introduce a five-level Capability Maturity Model describing the principles of rigorous scientific operations in projects ranging from small-scale exploratory studies to large-scale, multi-disciplinary research endeavors. Achieving higher levels of operational maturity necessitates the adoption of new, technology-enabled methodologies, which we refer to as SciOps. This concept is derived from the DevOps methodologies that have revolutionized the software industry. SciOps involves digital research environments that seamlessly integrate computational, automation, and AI-driven efforts throughout the research cycle-from experimental design and data collection to analysis and dissemination, ultimately leading to closed-loop discovery. This maturity model offers a framework for assessing and improving operational practices in multidisciplinary research teams, guiding them towards greater efficiency and effectiveness in scientific inquiry.
- Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).
- NASEM. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop (National Academies of Engineering and Medicine, 2022). URL https://nap.nationalacademies.org/catalog/26532/automated-research-workflows-for-accelerated-discovery-closing-the-knowledge-discovery.
- Paulk, M. C. A history of the capability maturity model for software. ASQ Software Quality Professional 12, 5–19 (2009).
- CMMI for development: guidelines for process integration and product improvement (Pearson Education, 2011).
- Lessons learned: A neuroimaging research center’s transition to open and reproducible science. Frontiers in big Data 82 (2022).
- Feyerabend, P. Against method: Outline of an anarchistic theory of knowledge (Verso Books, 2020).
- Artaza, H. et al. Top 10 metrics for life science software good practices. F1000Research 5 (2016).
- Eglen, S. J. et al. Toward standard practices for sharing computer code and programs in neuroscience. Nature neuroscience 20, 770–773 (2017).
- Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Scientific data 3, 1–9 (2016).
- Gorgolewski, K. J. et al. The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific data 3, 1–9 (2016).
- Rübel, O. et al. The neurodata without borders ecosystem for neurophysiological data science. Elife 11, e78362 (2022).
- Subash, P. et al. A comparison of neuroelectrophysiology databases. Scientific Data 10, 719 (2023).
- Hayashi, S. et al. brainlife.io: A decentralized and open source cloud platform to support neuroscience research. ArXiv (2023).
- Hider Jr, R. et al. The brain observatory storage service and database (BossDB): a cloud-native approach for petascale neuroscience discovery. Frontiers in Neuroinformatics 16, 828787 (2022).
- Markiewicz, C. J. et al. The OpenNeuro resource for sharing of neuroscience data. Elife 10, e71774 (2021).
- Halchenko, Y. et al. Datalad: distributed system for joint management of code, data, and their relationship. Journal of Open Source Software 6 (2021).
- Kalantari, A. et al. How to establish and maintain a multimodal animal research dataset using datalad. Scientific data 10, 357 (2023).
- Huerta, E. et al. FAIR for AI: An interdisciplinary and international community building perspective. Scientific Data 10, 487 (2023).
- Goble, C. et al. FAIR computational workflows. Data Intelligence 2, 108–121 (2020).
- Deelman, E. et al. The future of scientific workflows. The International Journal of High Performance Computing Applications 32, 159–175 (2018).
- Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nature methods 18, 1161–1168 (2021).
- Bonacchi, N. et al. A modular architecture for organizing, processing and sharing neurophysiology data. Nature Methods 20, 403–407 (2023).
- DevOps. IEEE Software 33, 94–100 (2016).
- A survey of DevOps concepts and challenges. ACM Computing Surveys (CSUR) 52, 1–35 (2019).
- Teixeira, D. et al. A maturity model for DevOps. International Journal of Agile Systems and Management 13, 464–511 (2020).
- Gartner. Gartner Hype Cycle for Data Management Positions Three Technologies in the Innovation Trigger Phase in 2018. https://www.gartner.com/en/newsroom/press-releases/2018-09-11-gartner-hype-cycle-for-data-management (2018). [Online; accessed 22-Dec-2023].
- Good practices for the adoption of DataOps in the software industry. In Journal of Physics: Conference Series, vol. 1694, 012032 (IOP Publishing, 2020).
- Atwal, H. Practical DataOps: Delivering agile data science at scale (Springer, 2019).
- Who needs MLOps: What data scientists seek to accomplish and how can MLOps help? In 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering for AI (WAIN), 109–112 (IEEE, 2021).
- Milewicz, R. et al. DevOps pragmatic practices and potential perils in scientific software development. In International Congress on Information and Communication Technology, 629–647 (Springer, 2023).
- Network, B. I. C. C. A multimodal cell census and atlas of the mammalian primary motor cortex. Nature 598, 86–102 (2021).
- Consortium, M. et al. Functional connectomics spanning multiple areas of mouse visual cortex. BioRxiv 2021–07 (2021).
- Abbott, L. F. et al. An international laboratory for systems and computational neuroscience. Neuron 96, 1213–1218 (2017).
- Schirner, M. et al. Brain simulation as a cloud service: The virtual brain on EBRAINS. NeuroImage 251, 118973 (2022).
- VRE. Virtual Research Environment. https://vre.charite.de/vre/ (2023). [Online; accessed 22-Dec-2023].
- EBRAINS. EBRAINS. https://www.ebrains.eu/ (2023). [Online; accessed 22-Dec-2023].
- HealthDataCloud. HealthDataCloud. https://www.healthdatacloud.eu/ (2023). [Online; accessed 22-Dec-2023].
- eBRAIN Health. eBRAIN-Health. https://ebrain-health.eu/home.html (2023). [Online; accessed 22-Dec-2023].
- Walker, E. Y. et al. Inception loops discover what excites neurons most using deep predictive models. Nature neuroscience 22, 2060–2065 (2019).
- Abrams, M. B. et al. A standards organization for open and FAIR neuroscience: the international neuroinformatics coordinating facility. Neuroinformatics 20, 25–36 (2022).
- Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nature biotechnology 35, 316–319 (2017).
- Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
- Vivian, J. et al. Toil enables reproducible, open source, big biomedical data analyses. Nature biotechnology 35, 314–316 (2017).
- Bhat, M. et al. Magic quadrant for devops platforms. https://www.gartner.com/doc/reprints?id=1-2DW4I0FF&ct=230601&st=sb (2023). [Online; accessed 22-Dec-2023].
- Sandström, M. et al. Recommendations for repositories and scientific gateways from a neuroscience perspective. Scientific Data 9, 212 (2022).
- Duncan, D. et al. Data archive for the brain initiative (DABI). Scientific Data 10, 83 (2023).
- Bandrowski, A. et al. SPARC data structure: Rationale and design of a fair standard for biomedical research data. bioRxiv 2021–02 (2021).
- Afgan, E. et al. The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic acids research 50 (2022).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.