Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life \& Earth Sciences (2403.20063v1)

Published 29 Mar 2024 in cs.DB

Abstract: This vision paper introduces a pioneering data lake architecture designed to meet Life & Earth sciences' burgeoning data management needs. As the data landscape evolves, the imperative to navigate and maximize scientific opportunities has never been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a collaborative space conducive to scientific discovery.The core of the design and construction of a data lake is the development of formal and semi-automatic tools, enabling the meticulous curation of quantitative and qualitative data from experiments. Our unique ''research-in-the-loop'' methodology ensures that scientists across various disciplines are integrally involved in the curation process, combining automated, mathematical, and manual tasks to address complex problems, from seismic detection to biodiversity studies. By fostering reproducibility and applicability of research, our approach enhances the integrity and impact of scientific experiments. This initiative is set to improve data management practices, strengthening the capacity of Life & Earth sciences to solve some of our time's most critical environmental and biological challenges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Laclichev: Exploring the history of climate change in latin america within newspapers digital collections, in: New Trends in Database and Information Systems: ADBIS 2021 Short Papers, Doctoral Consortium and Workshops: DOING, SIMPDA, MADEISD, MegaData, CAoNS, Tartu, Estonia, August 24-26, 2021, Proceedings, Springer, 2021, pp. 121–132.
  2. Redes sociais como uma fonte de dados alternativa para monitorar águas-vivas no brasil, in: Livro de Memórias do IV SUSTENTARE e VII WIPIS: Workshop internancional de Sustentabilidade, Indicadores e Gestão de Recursos Hídricos (Online) – Even3, Piracicaba, 2022.
  3. Factextract: automatic collection and aggregation of articles and journalistic factual claims from online newspaper, in: 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), IEEE, 2018, pp. 336–341.
  4. " research data curation in visualization: Position paper"(data) (2023).
  5. Leveraging machine learning to detect data curation activities, in: 2021 IEEE 17th International Conference on eScience (eScience), IEEE, 2021, pp. 149–158.
  6. What drives and inhibits researchers to share and use open research data? A systematic literature review to analyze factors influencing open research data adoption, PloS One 15 (2020).
  7. M. Vuorre, J. P. Curley, Curating research assets: A tutorial on the git version control system, Advances in Methods and Practices in Psychological Science 1 (2018) 219–236.
  8. Synchronic curation for assessing reuse and integration fitness of multiple data collections (2022).
  9. Researcher reflexivity: exploring the impacts of caqdas use, International Journal of Social Research Methodology 19 (2016) 385–403.
  10. Using machine learning to support qualitative coding in social science: Shifting the focus to ambiguity, ACM Transactions on Interactive Intelligent Systems 8 (2018) 1–20.
  11. J. C. Evers, Current issues in qualitative data analysis software (qdas): A user and developer perspective, The Qualitative Report 23 (2018) 61–73.
  12. How data curation enables epistemically responsible reuse of qualitative data, The Qualitative Report 26 (2021) 1996–2010.
  13. Leveraging the data lake: Current state and challenges, in: Big Data Analytics and Knowledge Discovery: 21st International Conference, DaWaK 2019, Linz, Austria, August 26–29, 2019, Proceedings 21, Springer, 2019, pp. 179–188.
  14. Data lake concept and systems: a survey, arXiv preprint arXiv:2106.09592 (2021).
  15. P. Russom, Data warehouse modernization, TDWI Best Pract Rep (2016).
  16. Automating open science for big data, The ANNALS of the American Academy of Political and Social Science 659 (2015) 260–273.
  17. Using provenance in data analytics for seismology: Challenges and directions, in: European Conference on Advances in Databases and Information Systems, Springer, 2022, pp. 311–322.
  18. Towards a human-in-the-loop curation: A qualitative perspective, in: 2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA), IEEE, 2022, pp. 1–8.
  19. The fair guiding principles for scientific data management and stewardship, Scientific data 3 (2016) 1–9.
  20. goldmedal: une nouvelle contribution à la modélisation générique des métadonnées des lacs de données, Revue des Nouvelles Technologies de l’Information (2021).
  21. Métadonnées des lacs de données et principes fair, in: 18e journées Business Intelligence et Big Data (EDA 2022), 2022.
  22. R. Van de Schoot, J. de Bruin, Researcher-in-the-loop for systematic reviewing of text databases, Zenodo: SciNLP: Natural Language Processing and Data Mining for Scientific Text (2020).
  23. I. Rahwan, Society-in-the-loop: programming the algorithmic social contract, Ethics and information technology 20 (2018) 5–14.
  24. Human-in-the-loop machine learning: A state of the art, Artificial Intelligence Review 56 (2023) 3005–3054.
  25. The role of metadata in reproducible computational research, Patterns 2 (2021) 100322.
  26. A fair model catalog for ontology-driven conceptual modeling research, Conceptual Modeling. ER 73 (2022).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Genoveva Vargas-Solar (22 papers)
  2. Jérôme Darmont (92 papers)
  3. Alejandro Adorjan (2 papers)
  4. Javier A. Espinosa-Oviedo (6 papers)
  5. Carmem Hara (2 papers)
  6. Sabine Loudcher (20 papers)
  7. Regina Motz (4 papers)
  8. Martin Musicante (1 paper)
  9. José-Luis Zechinelli-Martini (4 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.