Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ProvDeploy: Provenance-oriented Containerization of High Performance Computing Scientific Workflows (2403.15324v2)

Published 22 Mar 2024 in cs.DC and cs.DB

Abstract: Many existing scientific workflows require High Performance Computing environments to produce results in a timely manner. These workflows have several software library components and use different environments, making the deployment and execution of the software stack not trivial. This complexity increases if the user needs to add provenance data capture services to the workflow. This manuscript introduces ProvDeploy to assist the user in configuring containers for scientific workflows with integrated provenance data capture. ProvDeploy was evaluated with a Scientific Machine Learning workflow, exploring containerization strategies focused on provenance in two distinct HPC environments

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Towards HPC and big data analytics convergence: Design and experimental evaluation of a HPDA framework for escience at scale. IEEE Access, 9:73307–73326, 2021.
  2. Bioinformatics application with kubeflow for batch processing in clouds. In HPDC, pages 355–367. Springer, 2020.
  3. Provenance for computational tasks: A survey. Computing in science & engineering, 10(3):11–21, 2008.
  4. Data analytics in bioinformatics: Data science in practice for genomics analysis workflows. In IEEE e-Science, pages 322–331. IEEE, 2015.
  5. Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach. Cluster Comp., 23(1):123–148, 2020.
  6. Advancing adoption of reproducibility in HPC: A preface to the special section. IEEE Trans. Par. Dist. Syst., 33(9):2011–2013, 2022.
  7. Real-time containers: A survey. In Fog-IoT, volume 80 of OASIcs, pages 7:1–7:9, 2020.
  8. Performance characterization of containerization for HPC workloads on infiniband clusters: an empirical study. Clust. Comput., 25(2):847–868, 2022.
  9. Minimizing privilege for building hpc containers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, November 2021.
  10. R Shane Canon. The role of containers in reproducibility. In 2020 2nd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC), pages 19–25. IEEE, December 2020.
  11. Complete provenance for application experiments with containers and hardware interface metadata. In 2022 IEEE/ACM 4th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC), pages 12–24. IEEE, 2022.
  12. Use of application containers and workflows for genomic data analysis. Journal of pathology informatics, 7(1):53, 2016.
  13. Multiscale scientific workflows on high-performance hybrid cloud. In 2022 IEEE/ACM 4th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC), pages 1–11. IEEE, 2022.
  14. Reproducible notebook containers using application virtualization. In 2022 IEEE 18th International Conference on e-Science (e-Science), pages 1–10. IEEE, 2022.
  15. Managing provenance data in knowledge graph management platforms. pages 1–10. Springer, 2024.
  16. Paced: Provenance-based automated container escape detection. In 2022 IEEE International Conference on Cloud Engineering (IC2E), pages 261–272. IEEE, 2022.
  17. Clarion: Sound and clear provenance tracking for microservice deployments. In USENIX Security, pages 3989–4006, 2021.
  18. Shifter: Containers for hpc. In Journal of physics: Conference series, volume 898, page 082021. IOP Publishing, 2017.
  19. Charliecloud: unprivileged containers for user-defined software stacks in hpc. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, New York, NY, USA, 2017. Association for Computing Machinery.
  20. Singularity: Scientific containers for mobility of compute. PloS one, 12(5):e0177459, 2017.
  21. An empirical analysis of the docker container ecosystem on github. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pages 323–333. IEEE, 2017.
  22. Provenance: an introduction to prov. Synthesis lectures on the semantic web: theory and technology, 3(4):1–129. Morgan & Claypool Publishers, 2013.
  23. Dynamic steering of hpc scientific workflows: A survey. Future Generation Computer Systems, 46:100–113, 2015.
  24. Dfanalyzer: Runtime dataflow analysis tool for computational science and engineering applications. SoftwareX, 12:100592, 2020.
  25. Auto-scaling of scientific workflows in kubernetes. In ICCS, pages 33–40. Springer, 2022.
  26. Container orchestration on hpc systems through kubernetes. Journal of Cloud Computing, 10(1):1–14, 2021.
  27. An empirical study of container image configurations and their impact on start times. IEEE Xplore, 2023.
  28. Reprozip: Computational reproducibility with ease. In SIGMOD, pages 2085–2088. ACM, ACM, 2016.
  29. Sciunits: Reusable research objects. In 2017 IEEE 13th International Conference on e-Science (e-Science), pages 374–383, 2017.
  30. Research objects: Towards exchange and reuse of digital knowledge. Nature Proc., pages 1–6, 2010.
  31. Reproserver: making reproducibility easier and less intensive. arXiv preprint arXiv:1808.01406, 2018.
  32. Kubeadaptor: A docking framework for workflow containerization on kubernetes. Future Generation Computer Systems, 148:584–599, 2023.
  33. Container-based bioinformatics with pachyderm. Bioinformatics, 35(5):839–846, 2019.
  34. Ekaba Bisong. Kubeflow and kubeflow pipelines. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, pages 671–685, 2019.
  35. Scipipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines. GigaScience, 8(5):giz044, 2019.
  36. Chex: multiversion replay with ordered checkpoints. Proceedings of the VLDB Endowment, 15(6):1297–1310, 2022.
  37. Prov-crt: Provenance support for container runtimes. In TaPP 2020, pages 1–3, 2020.
  38. Skyport-container-based execution environment management for multi-cloud scientific workflows. In 2014 5th International Workshop on Data-Intensive Computing in the Clouds, pages 25–32. IEEE, 2014.
  39. Asterism: Pegasus and dispel4py hybrid workflows for data-intensive science. In 2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud), pages 1–8. IEEE, 2016.
  40. Custom execution environments with containers in pegasus-enabled scientific workflows. In 2019 15th International Conference on eScience (eScience), pages 281–290. IEEE, 2019.
  41. Integrating containers into workflows: A case study using makeflow, work queue, and docker. In WVTDC, pages 31–38, 2015.
  42. Secure namespaced kernel audit for containers. In Proceedings of the ACM Symposium on Cloud Computing, pages 518–532, 2021.
  43. Realising data-centric scientific workflows with provenance-capturing on data lakes. Data Intelligence, 4(2):426–438, 2022.
  44. Querying container provenance. In Companion Proceedings of the ACM Web Conference 2023, WWW ’23 Companion, page 1564–1567, New York, NY, USA, 2023. Association for Computing Machinery.
  45. Disprotrack: Distributed provenance tracking over serverless applications. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications, pages 1–10. IEEE, 2023.
  46. Prov-io+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT: A cross-platform provenance framework for scientific data on hpc systems. IEEE Transactions on Parallel and Distributed Systems, 2024.
  47. Building trust in earth science findings through data traceability and results explainability. IEEE Transactions on Parallel and Distributed Systems, 34(2):704–717, 2022.
  48. Landlord: Coordinating dynamic software environments to reduce container sprawl. IEEE Transactions on Parallel and Distributed Systems, 2023.
  49. noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. Proc. VLDB Endow., 10(12):1841–1844, 2017.
  50. An encoder-decoder deep surrogate for reverse time migration in seismic imaging under uncertainty. Computational Geosciences, 25:1229–1250, 2021.
  51. Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. Journal of Computational Physics, 366:415–447, 2018.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com