Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization (1706.03791v1)

Published 12 Jun 2017 in cs.IT and math.IT

Abstract: Today's HPC applications are producing extremely large amounts of data, such that data storage and analysis are becoming more challenging for scientific research. In this work, we design a new error-controlled lossy compression algorithm for large-scale scientific data. Our key contribution is significantly improving the prediction hitting rate (or prediction accuracy) for each data point based on its nearby data values along multiple dimensions. We derive a series of multilayer prediction formulas and their unified formula in the context of data compression. One serious challenge is that the data prediction has to be performed based on the preceding decompressed values during the compression in order to guarantee the error bounds, which may degrade the prediction accuracy in turn. We explore the best layer for the prediction by considering the impact of compression errors on the prediction accuracy. Moreover, we propose an adaptive error-controlled quantization encoder, which can further improve the prediction hitting rate considerably. The data size can be reduced significantly after performing the variable-length encoding because of the uneven distribution produced by our quantization encoder. We evaluate the new compressor on production scientific data sets and compare it with many other state-of-the-art compressors: GZIP, FPZIP, ZFP, SZ-1.1, and ISABELA. Experiments show that our compressor is the best in class, especially with regard to compression factors (or bit-rates) and compression errors (including RMSE, NRMSE, and PSNR). Our solution is better than the second-best solution by more than a 2x increase in the compression factor and 3.8x reduction in the normalized root mean squared error on average, with reasonable error bounds and user-desired bit-rates.

Citations (236)

View on Semantic Scholar

Summary

The paper presents a multidimensional prediction model that significantly enhances data accuracy compared to traditional single-dimensional methods.
The paper develops error-controlled quantization to adapt precision while strictly adhering to user-defined error bounds.
Empirical results show SZ-1.4 achieves over double the compression ratio and nearly fourfold reduction in error versus leading techniques.

Improving Lossy Compression for Scientific Data Sets

The paper entitled "Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization" by Dingwen Tao et al. presents a novel lossy compression algorithm designed to enhance the handling of large-scale scientific data. The research introduces an error-controlled method, specifically tailored to cope with the immense volume and variability of data generated by high-performance computing (HPC) applications. By leveraging a combination of multidimensional prediction and adaptive quantization, the proposed algorithm, SZ-1.4, is positioned as a superior solution relative to extant techniques such as GZIP, FPZIP, ZFP, SZ-1.1, and ISABELA.

Core Contributions

The principal contributions of this work can be outlined as follows:

Multidimensional Prediction Model: The authors introduce a generalized prediction model that extends beyond single-dimensional analysis, significantly improving data point prediction accuracy. Prior methodologies had primarily focused on curve-fitting or single-dimensional interpolation, which falter over sharply varying data sets. The multidimensional approach, coupled with an optimal selection of data points for prediction, promises enhanced compression efficacy.
Error-Controlled Quantization: A sophisticated adaptive quantization method is developed that allows for an increased precision in handling data irregularities without transgressing user-defined error bounds. This differs fundamentally from traditional non-uniform vector quantization as it mandates that each quantization interval is unequivocally tied to a fixed error bound.
Empirical Validation: Comprehensive experiments on real-world scientific data encompassing climate simulations, X-ray research, and hurricane simulations underscore SZ-1.4's efficiency. Notably, the algorithm offers over a twofold improvement in compression factor and a near fourfold reduction in normalized root mean square error (NRMSE) compared to the next best solution.

Numerical Performance and Methodological Justification

The paper provides strong empirical support for its claims, with SZ-1.4 outperforming its competitors in compression factors and maintaining reduced compression error metrics such as RMSE, NRMSE, and PSNR across various error bounds. The introduction of the concept of a "prediction hitting rate"—a measure reflecting the proportion of predictable data—additionally serves as a significant metric in evaluating compression performance.

By employing adaptive variable-length encoding strategies, notably Huffman coding optimized for a varying number of quantization intervals, the authors manage to further enhance data reduction while maintaining the integrity of critical information.

Implications and Future Directions

From a practical perspective, the development of this algorithm could significantly impact fields that rely heavily on large-scale data analysis and simulation, such as climate research and cosmology, where the ability to reduce data sizes without loss of significant information is crucial. Theoretically, this work contributes to the broader discourse on lossy compression by proposing methodologies that circumvent limitations related to data smoothness and range constraints, prevalent in existing compressors like ZFP.

Future developments of this work may explore optimization of the algorithm for diverse computational architectures, improvement in speed performance, and further enhancement in controlling error autocorrelation, especially for data sets with high compression factors. These advancements could refine the application of lossy compression techniques in circumstances where scientific precision and data manageability both hold paramount importance.

This paper stands as a substantial step forward in adaptive lossy compression strategies, providing a robust framework adaptable for various scientific computing needs, and extending the potential of data compression well beyond traditional functionalities.

PDF Markdown