Hierarchical Autoencoder-based Lossy Compression for Large-scale High-resolution Scientific Data (2307.04216v2)
Abstract: Lossy compression has become an important technique to reduce data size in many domains. This type of compression is especially valuable for large-scale scientific data, whose size ranges up to several petabytes. Although Autoencoder-based models have been successfully leveraged to compress images and videos, such neural networks have not widely gained attention in the scientific data domain. Our work presents a neural network that not only significantly compresses large-scale scientific data, but also maintains high reconstruction quality. The proposed model is tested with scientific benchmark data available publicly and applied to a large-scale high-resolution climate modeling data set. Our model achieves a compression ratio of 140 on several benchmark data sets without compromising the reconstruction quality. 2D simulation data from the High-Resolution Community Earth System Model (CESM) Version 1.3 over 500 years are also being compressed with a compression ratio of 200 while the reconstruction error is negligible for scientific analysis.
- S. N. A. Laboratory. (2023) Linac coherent light source(lcls-ii). [Online]. Available: https://lcls.slac.stanford.edu/
- N. R. A. Observatory. (2023) The very large array radio telescope. [Online]. Available: https://public.nrao.edu/
- P. Chang, S. Zhang, G. Danabasoglu, S. G. Yeager, H. Fu, H. Wang, F. S. Castruccio, Y. Chen, J. Edwards, D. Fu, Y. Jia, L. C. Laurindo, X. Liu, N. Rosenbloom, R. J. Small, G. Xu, Y. Zeng, Q. Zhang, J. Bacmeister, D. A. Bailey, X. Duan, A. K. DuVivier, D. Li, Y. Li, R. Neale, A. Stössel, L. Wang, Y. Zhuang, A. Baker, S. Bates, J. Dennis, X. Diao, B. Gan, A. Gopal, D. Jia, Z. Jing, X. Ma, R. Saravanan, W. G. Strand, J. Tao, H. Yang, X. Wang, Z. Wei, and L. Wu, “An unprecedented set of high-resolution earth system simulations for understanding multiscale interactions in climate variability and change,” Journal of Advances in Modeling Earth Systems, vol. 12, no. 12, p. e2020MS002298, 2020, e2020MS002298 2020MS002298. [Online]. Available: https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2020MS002298
- S. W. Son, Z. Chen, W. Hendrix, A. Agrawal, W.-k. Liao, and A. Choudhary, “Data compression for the exascale computing era-survey,” Supercomputing frontiers and innovations, vol. 1, no. 2, pp. 76–88, 2014.
- P. Deutsch, “Gzip file format specification version 4.3,” Tech. Rep., 1996.
- Y. Collet and E. Kucherawy, “Zstandard-real-time data compression algorithm,” 2015.
- G. K. Wallace, “The jpeg still picture compression standard,” Communications of the ACM, vol. 34, no. 4, pp. 30–44, 1991.
- S. Di and F. Cappello, “Fast error-bounded lossy hpc data compression with sz,” in 2016 ieee international parallel and distributed processing symposium (ipdps). IEEE, 2016, pp. 730–739.
- D. Tao, S. Di, Z. Chen, and F. Cappello, “Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization,” in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2017, pp. 1129–1139.
- J. Kim, A. D. Baczewski, T. D. Beaudet, A. Benali, M. C. Bennett, M. A. Berrill, N. S. Blunt, E. J. L. Borda, M. Casula, D. M. Ceperley et al., “Qmcpack: an open source ab initio quantum monte carlo package for the electronic structure of atoms, molecules and solids,” Journal of Physics: Condensed Matter, vol. 30, no. 19, p. 195901, 2018.
- A. H. Baker, H. Xu, J. M. Dennis, M. N. Levy, D. Nychka, S. A. Mickelson, J. Edwards, M. Vertenstein, and A. Wegener, “A methodology for evaluating the impact of data compression on climate simulation data,” in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, 2014, pp. 203–214.
- N. Sasaki, K. Sato, T. Endo, and S. Matsuoka, “Exploration of lossy compression for application-level checkpoint/restart,” in 2015 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2015, pp. 914–922.
- A. H. Baker, H. Xu, D. M. Hammerling, S. Li, and J. P. Clyne, “Toward a multi-method approach: Lossy data compression for climate simulation data,” in International Conference on High Performance Computing. Springer, 2017, pp. 30–42.
- M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” in International conference on machine learning. PMLR, 2021, pp. 10 096–10 106.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- N. Johnston, E. Eban, A. Gordon, and J. Ballé, “Computationally efficient neural image compression,” arXiv preprint arXiv:1912.08771, 2019.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- A. Vahdat and J. Kautz, “Nvae: A deep hierarchical variational autoencoder,” Advances in Neural Information Processing Systems, vol. 33, pp. 19 667–19 679, 2020.
- D. Minnen and S. Singh, “Channel-wise autoregressive entropy models for learned image compression,” in 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2020, pp. 3339–3343.
- J. Liu, S. Di, K. Zhao, S. Jin, D. Tao, X. Liang, Z. Chen, and F. Cappello, “Exploring autoencoder-based error-bounded compression for scientific data,” in 2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2021, pp. 294–306.
- T. Liu, J. Wang, Q. Liu, S. Alibhai, T. Lu, and X. He, “High-ratio lossy compression: Exploring the autoencoder to compress scientific data,” IEEE Transactions on Big Data, 2021.
- P. Lindstrom, “Fixed-rate compressed floating-point arrays,” IEEE transactions on visualization and computer graphics, vol. 20, no. 12, pp. 2674–2683, 2014.
- P. Lindstrom and M. Isenburg, “Fast and efficient compression of floating-point data,” IEEE transactions on visualization and computer graphics, vol. 12, no. 5, pp. 1245–1250, 2006.
- K. Zhao, S. Di, M. Dmitriev, T.-L. D. Tonellot, Z. Chen, and F. Cappello, “Optimizing error-bounded lossy compression for scientific data by dynamic spline interpolation,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021, pp. 1643–1654.
- J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” arXiv preprint arXiv:1611.01704, 2016.
- J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436, 2018.
- J. J. Rissanen, “Generalized kraft inequality and arithmetic coding,” IBM Journal of research and development, vol. 20, no. 3, pp. 198–203, 1976.
- J. Rissanen and G. G. Langdon, “Arithmetic coding,” IBM Journal of research and development, vol. 23, no. 2, pp. 149–162, 1979.
- D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952.
- J. Ballé, N. Johnston, and D. Minnen, “Integer networks for data compression with latent-variable models,” in International Conference on Learning Representations, 2018.
- D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” Advances in neural information processing systems, vol. 31, 2018.
- F. Bellard. (2023) Bpg image format. [Online]. Available: https://bellard.org/bpg/
- D. Kim, M. Lee, and K. Museth, “Neuralvdb: High-resolution sparse volume representation using hierarchical neural networks,” arXiv preprint arXiv:2208.04448, 2022.
- A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
- J. N. Martel, D. B. Lindell, C. Z. Lin, E. R. Chan, M. Monteiro, and G. Wetzstein, “Acorn: Adaptive coordinate networks for neural scene representation,” arXiv preprint arXiv:2105.02788, 2021.
- A. Glaws, R. King, and M. Sprague, “Deep learning for in situ data compression of large turbulent flow simulations,” Physical Review Fluids, vol. 5, no. 11, p. 114602, 2020.
- A. Nasari, H. Le, R. Lawrence, Z. He, X. Yang, M. Krell, A. Tsyplikhin, M. Tatineni, T. Cockerill, L. Perez et al., “Benchmarking the performance of accelerators on national cyberinfrastructure resources for artificial intelligence/machine learning workloads,” in Practice and Experience in Advanced Research Computing, 2022, pp. 1–9.
- M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” Journal of Computational physics, vol. 378, pp. 686–707, 2019.
- J. Choi, M. Churchill, Q. Gong, S.-H. Ku, J. Lee, A. Rangarajan, S. Ranka, D. Pugmire, C. Chang, and S. Klasky, “Neural data compression for physics plasma simulation,” in Neural Compression: From Information Theory to Applications–Workshop@ ICLR 2021, 2021.
- Y. Zhuang, S. Cheng, N. Kovalchuk, M. Simmons, O. K. Matar, Y.-K. Guo, and R. Arcucci, “Ensemble latent assimilation with deep learning surrogate model: application to drop interaction in a microfluidics device,” Lab on a Chip, vol. 22, no. 17, pp. 3187–3202, 2022.
- C. Zhong, S. Cheng, M. Kasoar, and R. Arcucci, “Reduced-order digital twin and latent data assimilation for global wildfire prediction,” Natural hazards and earth system sciences, vol. 23, no. 5, pp. 1755–1768, 2023.
- S. Cheng, J. Chen, C. Anastasiou, P. Angeli, O. K. Matar, Y.-K. Guo, C. C. Pain, and R. Arcucci, “Generalised latent assimilation in heterogeneous reduced spaces with machine learning surrogate models,” Journal of Scientific Computing, vol. 94, no. 1, p. 11, 2023.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
- J. Ballé, V. Laparra, and E. P. Simoncelli, “Density modeling of images using a generalized normalization transformation,” arXiv preprint arXiv:1511.06281, 2015.
- J. Ballé, “Efficient nonlinear transforms for lossy image compression,” in 2018 Picture Coding Symposium (PCS). IEEE, 2018, pp. 248–252.
- C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther, “Ladder variational autoencoders,” Advances in neural information processing systems, vol. 29, 2016.
- A. Razavi, A. Van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,” Advances in neural information processing systems, vol. 32, 2019.
- Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
- K. Zhao, S. Di, X. Lian, S. Li, D. Tao, J. Bessac, Z. Chen, and F. Cappello, “Sdrbench: Scientific data reduction benchmark for lossy compressors,” in 2020 IEEE International Conference on Big Data (Big Data). IEEE, 2020, pp. 2716–2724.
- X. Liang, S. Di, D. Tao, S. Li, S. Li, H. Guo, Z. Chen, and F. Cappello, “Error-controlled lossy compression optimized for high compression ratios of scientific datasets,” in 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018, pp. 438–447.
- N. Wang, T. Liu, J. Wang, Q. Liu, S. Alibhai, and X. He, “Locality-based transfer learning on compression autoencoder for efficient scientific data lossy compression,” Journal of Network and Computer Applications, vol. 205, p. 103452, 2022.