Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SHRINK: Data Compression by Semantic Extraction and Residuals Encoding (2410.06713v2)

Published 9 Oct 2024 in cs.DC

Abstract: The distributed data infrastructure in Internet of Things (IoT) ecosystems requires efficient data-series compression methods, along with the ability to feed different accuracy demands. However, the compression performance of existing compression methods degrades sharply when calling for ultra-accurate data recovery. In this paper, we introduce SHRINK, a novel highly accurate data compression method that offers a higher compression ratio and also lower runtime than prior compressors. SHRINK extracts data semantics in the form of linear segments to construct a compact knowledge base, using a dynamic error threshold that it adapts to data characteristics. Then, it captures the remaining data details as residuals to support lossy compression at diverse resolutions as well as lossless compression. As SHRINK identifies repeated semantics, its compression ratio increases with data size. Our experimental evaluation demonstrates that SHRINK outperforms state-of-art methods with an up to threefold improvement in compression ratio.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Glean: Generalized-deduplication-enabled approximate edge analytics. IEEE Internet of Things Journal, 10(5):4006–4020, 2022.
  2. A randomly accessible lossless compression scheme for time-series data. In IEEE INFOCOM - Conference on Computer Communications, pages 2145–2154, 2020.
  3. Sprintz: Time series compression for the internet of things. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(3):1–23, 2018.
  4. Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015.
  5. Chimp: efficient lossless floating point compression for time series databases. Proceedings of the VLDB Endowment, 15(11):3058–3070, 2022.
  6. One-pass wavelet synopses for maximum-error metrics. In Proceedings of the 31st International Conference on Very Large Data Bases, pages 421–432. ACM, 2005.
  7. Hierarchical synopses with optimal error guarantees. ACM Trans. Database Syst., 33(3):18:1–18:53, 2008.
  8. LFZip: Lossy compression of multivariate floating-point time series data via improved prediction. In Data Compression Conference, DCC, pages 342–351, 2020.
  9. Hierarchical residual encoding for multiresolution time series compression. Proceedings of the ACM on Management of Data, 1(1):1–26, 2023.
  10. Locally adaptive dimensionality reduction for indexing large time series databases. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 151–162. ACM, 2001.
  11. GreedyGD: Enhanced generalized deduplication for direct analytics in iot. IEEE Transactions on Industrial Informatics, pages 1–9, 2024.
  12. VergeDB: A database for iot analytics on edge devices. In 11th Conference on Innovative Data Systems Research, CIDR, 2021.
  13. Sim-Piece: Highly accurate piecewise linear approximation through similar segment merging. Proc. VLDB Endow., 16(8):1910–1922, 2023.
  14. Fiting-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD, pages 1189–1206, 2019.
  15. Finding semantics in time series. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 385–396. ACM, 2011.
  16. John G Proakis. Digital signal processing: principles, algorithms, and applications, 4/E. Pearson Education India, 2007.
  17. Novel online methods for time series segmentation. IEEE Trans. Knowl. Data Eng., 20(12):1616–1626, 2008.
  18. Adaptive error bounded piecewise linear approximation for time-series representation. Engineering Applications of Artificial Intelligence, 126:106892, 2023.
  19. Exploiting duality in summarization with deterministic guarantees. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 380–389, 2007.
  20. Efficient algorithms for interval graphs and circular-arc graphs. Networks, 12(4):459–467, 1982.
  21. Similarity search in the blink of an eye with compressed indices. arXiv preprint arXiv:2304.04759, 2023.
  22. Frequency domain data encoding in Apache IoTDB. Proceedings of the VLDB Endowment, 16(2):282–290, 2022.
  23. Michael Burrows. A block-sorting lossless data compression algorithm. SRS Research Report, 124, 1994.
  24. Turbo range coder. https://github.com/powturbo/Turbo-Range-Coder, 2024. Accessed: 2024-01-09.
  25. The UCR time series classification archive. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/, 2024. Accessed: 2024-01-09.
  26. Dataset: Ecg5000. https://www.timeseriesclassification.com/description.php?Dataset=ECG5000, 2024. Accessed: 2024-01-09.
  27. Piecewise linear approximation of streaming time series data with max-error guarantees. In 31st IEEE International Conference on Data Engineering, ICDE, pages 173–184, 2015.
  28. Online piece-wise linear approximation of numerical streams with precision guarantees. Proc. VLDB Endow., 2(1):145–156, 2009.
  29. Bzip2 compression tool. https://sourceware.org/bzip2/, 2024. Accessed: 2024-01-09.
  30. gzip compression tool. https://www.gnu.org/software/gzip/, 2024. Accessed: 2024-01-09.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com