Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

High Throughput Training of Deep Surrogates from Large Ensemble Runs (2309.16743v1)

Published 28 Sep 2023 in cs.LG, cs.AI, and cs.DC

Abstract: Recent years have seen a surge in deep learning approaches to accelerate numerical solvers, which provide faithful but computationally intensive simulations of the physical world. These deep surrogates are generally trained in a supervised manner from limited amounts of data slowly generated by the same solver they intend to accelerate. We propose an open-source framework that enables the online training of these models from a large ensemble run of simulations. It leverages multiple levels of parallelism to generate rich datasets. The framework avoids I/O bottlenecks and storage issues by directly streaming the generated data. A training reservoir mitigates the inherent bias of streaming while maximizing GPU throughput. Experiment on training a fully connected network as a surrogate for the heat equation shows the proposed approach enables training on 8TB of data in 2 hours with an accuracy improved by 47% and a batch throughput multiplied by 13 compared to a traditional offline procedure.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. AtlFast3: the next generation of fast simulation in ATLAS. Computing and software for big science 6, 1 (2022), 7.
  2. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1 (2015), 19–25.
  3. Flux: Overcoming scheduling challenges for exascale workflows. Future Generation Computer Systems 110 (2020), 202–213.
  4. High Performance I/O For Large Scale Deep Learning. In 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, December 9-12, 2019. IEEE, 5965–5967. https://doi.org/10.1109/BigData47090.2019.9005703
  5. Co-design Center for Exascale Machine Learning Technologies (ExaLearn). The International Journal of High Performance Computing Applications 35, 6 (2021), 598–616. https://doi.org/10.1177/10943420211029302 Publisher: SAGE Publications Ltd STM.
  6. Inverse Design for Fluid-Structure Interactions using Graph Network Simulators. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=HaZuqj0Gvp2
  7. The Data Assimilation Research Testbed: A Community Facility. Bulletin of the American Meteorological Society 90, 9 (2009), 1283–1296. Publisher: American Meteorological Society.
  8. Parsl: Pervasive Parallel Programming in Python. In 28th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC). 25–36.
  9. Adaptive Ensemble Biomolecular Applications at Scale. SN Computer Science 1, 2 (2020), 1–15.
  10. RADICAL-Cybertools: Middleware Building Blocks for Scalable Science. https://arxiv.org/abs/1904.03085
  11. Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications. In IPDPS 2018. 536–545.
  12. In situ methods, infrastructures, and applications on high performance computing platforms. In Computer Graphics Forum, Vol. 35. Wiley Online Library, 577–597.
  13. Etalumis: Bringing probabilistic programming to scientific simulators at scale. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1–24.
  14. Dota 2 with large scale deep reinforcement learning. ArXiv preprint abs/1912.06680 (2019). https://arxiv.org/abs/1912.06680
  15. Verification, Validation and Uncertainty Quantification of Large-Scale Applications with QCG-PilotJob. In Computational Science – ICCS 2021 (Lecture Notes in Computer Science), Maciej Paszynski, Dieter Kranzlmüller, Valeria V. Krzhizhanovskaya, Jack J. Dongarra, and Peter M.A. Sloot (Eds.). Springer International Publishing, Cham, 495–501. https://doi.org/10.1007/978-3-030-77977-1_39
  16. Optimization methods for large-scale machine learning. Siam Review 60, 2 (2018), 223–311.
  17. Coupling streaming AI and HPC ensembles to achieve 100–1000×\times× faster biomolecular simulations. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 806–816.
  18. Message Passing Neural PDE Solvers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=vSix3HPYKSU
  19. Machine learning for fluid mechanics. Annual review of fluid mechanics 52 (2020), 477–508.
  20. A batch scheduler with high level components. In CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005., Vol. 2. IEEE, 776–783.
  21. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences 117, 48 (2020), 30055–30062.
  22. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  23. Clairvoyant prefetching for distributed machine learning I/O. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
  24. Pavlos S Efraimidis and Paul G Spirakis. 2006. Weighted random sampling with a reservoir. Information processing letters 97, 5 (2006), 181–185.
  25. Parameter sweep and optimization of loosely coupled simulations using the DAKOTA toolkit. In Computational Science and Engineering (CSE), 2012 IEEE 15th International Conference on. 102–110.
  26. Revisiting Fundamentals of Experience Replay. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 3061–3071. http://proceedings.mlr.press/v119/fedus20a.html
  27. Sebastian Friedemann and Bruno Raffin. 2022. An elastic framework for ensemble-based large-scale data assimilation. The international journal of high performance computing applications 36, 4 (2022), 543–563.
  28. Machine-learning-based spatio-temporal super resolution reconstruction of turbulent flows. Journal of Fluid Mechanics 909 (2021), A9.
  29. Adios 2: The adaptable input output system. a framework for high-performance data management. SoftwareX 12 (2020), 100561.
  30. Accurate, large minibatch sgd: Training imagenet in 1 hour. ArXiv preprint abs/1706.02677 (2017). https://arxiv.org/abs/1706.02677
  31. NVIDIA SimNet™: An AI-accelerated multi-physics simulation framework. In International Conference on Computational Science. Springer, 447–461.
  32. Pieter Hintjens. 2013. ZeroMQ: messaging for many applications. " O’Reilly Media, Inc.".
  33. Online learning: A comprehensive survey. Neurocomputing 459 (2021), 249–289.
  34. Distributed Prioritized Experience Replay. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=H1Dy---0Z
  35. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019, Ameet Talwalkar, Virginia Smith, and Matei Zaharia (Eds.). mlsys.org. https://proceedings.mlsys.org/book/265.pdf
  36. Physics-informed machine learning. Nature Reviews Physics 3, 6 (2021), 422–440.
  37. Building high accuracy emulators for scientific simulations with deep neural architecture search. Machine Learning: Science and Technology 3, 1 (2021), 015013.
  38. Deep Fluids: A Generative Network for Parameterized Fluid Simulations. Computer Graphics Forum (Proc. Eurographics) 38, 2 (2019).
  39. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526.
  40. Machine learning–accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences 118, 21 (2021).
  41. Characterizing possible failure modes in physics-informed neural networks. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 26548–26560. https://proceedings.neurips.cc/paper/2021/hash/df438e5206f31600e6ae4af72f2725f1-Abstract.html
  42. Simulation Intelligence: Towards a New Generation of Scientific Methods. https://arxiv.org/abs/2112.03235
  43. Scalable HPC & AI infrastructure for COVID-19 therapeutics. In Proceedings of the Platform for Advanced Scientific Computing Conference. 1–13.
  44. Deepdrivemd: Deep-learning driven adaptive molecular simulations for protein folding. In 2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS). IEEE, 12–19.
  45. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13, 12 (2020).
  46. Fourier Neural Operator for Parametric Partial Differential Equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=c8P9NQVtmnO
  47. Transformer for Partial Differential Equations’ Operator Learning. Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=EPPqt3uERT
  48. RLlib: Abstractions for Distributed Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 (Proceedings of Machine Learning Research, Vol. 80), Jennifer G. Dy and Andreas Krause (Eds.). PMLR, 3059–3068. http://proceedings.mlr.press/v80/liang18b.html
  49. Simple computational strategies for more effective physics-informed neural networks modeling of turbulent natural convection. J. Comput. Phys. 456 (2022), 111022.
  50. Korali: Efficient and scalable software framework for Bayesian uncertainty quantification and stochastic optimization. Computer Methods in Applied Mechanics and Engineering 389 (2022), 114264. https://doi.org/10.1016/j.cma.2021.114264
  51. Probabilistic neural networks for fluid flow surrogate modeling and data recovery. Physical Review Fluids 5, 10 (2020), 104401.
  52. Radical-pilot: Scalable execution of heterogeneous and dynamic workloads on supercomputers. CoRR, abs/1512.08194 (2015).
  53. Training Deep Surrogate Models with Large Scale Online Learning. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 24614–24630. https://proceedings.mlr.press/v202/meyer23b.html
  54. ImageNet/ResNet-50 Training in 224 Seconds. ArXiv preprint abs/1811.05233 (2018). https://arxiv.org/abs/1811.05233
  55. Parviz Moin and Krishnan Mahesh. 1998. Direct numerical simulation: a tool in turbulence research. Annual review of fluid mechanics 30, 1 (1998), 539–578.
  56. Ray: a distributed framework for emerging AI applications. In Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Carlsbad, CA, USA, 561–577.
  57. Massively parallel methods for deep reinforcement learning. ArXiv preprint abs/1507.04296 (2015). https://arxiv.org/abs/1507.04296
  58. Lars Nerger and Wolfgang Hiller. 2013. Software for ensemble-based data assimilation systems—Implementation strategies and scalability. Computers & Geosciences 55 (2013), 110–118.
  59. Continual lifelong learning with neural networks: A review. Neural Networks 113 (2019), 54–71.
  60. Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems. In 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 1–8. https://doi.org/10.1109/MASCOTS53633.2021.9614303
  61. Learning Mesh-Based Simulation with Graph Networks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=roNqYL0_XP
  62. Copernicus: A new paradigm for parallel adaptive molecular dynamics. In 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1–10. https://doi.org/10.1145/2063384.2063465
  63. Universal differential equations for scientific machine learning. ArXiv preprint abs/2001.04385 (2020). https://arxiv.org/abs/2001.04385
  64. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics 378 (2019), 686–707.
  65. Collect & infer-a fresh look at data-efficient reinforcement learning. In Conference on Robot Learning. PMLR, 1736–1744.
  66. Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. Proceedings of the 14th Python in Science Conference (2015), 126–132. https://doi.org/10.25080/Majora-7b98e3ed-013 Conference Name: Proceedings of the 14th Python in Science Conference.
  67. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.
  68. PDI, an approach to decouple I/O concerns from high-performance simulation codes. (2017). https://hal.archives-ouvertes.fr/hal-01587075 working paper or preprint.
  69. Online Deep Learning: Learning Deep Neural Networks on the Fly. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, Jérôme Lang (Ed.). ijcai.org, 2660–2666. https://doi.org/10.24963/ijcai.2018/369
  70. E(n) Equivariant Graph Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 9323–9332. http://proceedings.mlr.press/v139/satorras21a.html
  71. Melissa: coordinating large-scale ensemble runs for deep learning and sensitivity analyses. Journal of Open Source Software 8, 86 (2023), 5291. https://doi.org/10.21105/joss.05291
  72. Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv preprint abs/1909.08053 (2019). https://arxiv.org/abs/1909.08053
  73. Mastering the game of Go without human knowledge. Nature 550, 7676 (2017), 354–359. https://doi.org/10.1038/nature24270
  74. Justin Sirignano and Konstantinos Spiliopoulos. 2018. DGM: A deep learning algorithm for solving partial differential equations. Journal of computational physics 375 (2018), 1339–1364.
  75. AI for Science: Report on the Department of Energy (DOE) Town Halls on Artificial Intelligence (AI) for Science. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).
  76. PDEBench: An extensive benchmark for scientific machine learning. Advances in Neural Information Processing Systems 35 (2022), 1596–1611.
  77. A deep-learning-based surrogate model for data assimilation in dynamic subsurface flow problems. J. Comput. Phys. 413 (2020), 109456.
  78. PyCOMPSs: Parallel computational workflows in Python. The International Journal of High Performance Computing Applications 31, 1 (2017), 66–82. https://doi.org/10.1177/1094342015594678 Publisher: SAGE Publications Ltd STM.
  79. Melissa: large scale in transit sensitivity analysis avoiding intermediate files. In Proceedings of the international conference for high performance computing, networking, storage and analysis (SC’17). 1–14.
  80. Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=KUDUoRsEphu
  81. Towards Physics-informed Deep Learning for Turbulent Flow Prediction. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1457–1466. https://dl.acm.org/doi/10.1145/3394486.3403198
  82. Massive computational acceleration by using neural networks to emulate mechanism-based biological models. Nature communications 10, 1 (2019), 1–9.
  83. Colmena: Scalable machine-learning-based steering of ensemble simulations for high performance computing. In 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). IEEE, 9–20.
  84. Envpool: A highly parallel reinforcement learning environment execution engine. Advances in Neural Information Processing Systems 35 (2022), 22409–22421.
  85. On the Acceleration of Deep Learning Model Parallelism With Staleness. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2085–2094. https://doi.org/10.1109/CVPR42600.2020.00216
  86. A 1024-Member Ensemble Data Assimilation with 3.5-Km Mesh Global Weather Simulations. In Supercomputing 2020: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, Los Alamitos, CA, USA, 1–10.
  87. Slurm: Simple linux utility for resource management. In Workshop on job scheduling strategies for parallel processing. Springer, 44–60.
  88. Proxima: accelerating the integration of machine learning in atomistic simulations. In Proceedings of the ACM International Conference on Supercomputing (ICS ’21). Association for Computing Machinery, New York, NY, USA, 242–253. https://doi.org/10.1145/3447818.3460370
  89. Experience Replay Optimization. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, Sarit Kraus (Ed.). ijcai.org, 4243–4249. https://doi.org/10.24963/ijcai.2019/589
  90. A framework of dual replay buffer: balancing forgetting and generalization in reinforcement learning. In Proceedings of the 2nd Workshop on Scaling Up Reinforcement Learning (SURL), International Joint Conference on Artificial Intelligence (IJCAI).
  91. Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. J. Comput. Phys. 394 (2019), 56–81.
  92. PHDFS: Optimizing I/O Performance of HDFS in Deep Learning Cloud Computing Platform. Journal of Systems Architecture 109 (2020), 101810. https://doi.org/10.1016/j.sysarc.2020.101810
Citations (5)

Summary

We haven't generated a summary for this paper yet.