DeepZip: Lossless Data Compression using Recurrent Neural Networks (1811.08162v1)

Published 20 Nov 2018 in cs.CL, eess.SP, and q-bio.GN

Abstract: Sequential data is being generated at an unprecedented pace in various forms, including text and genomic data. This creates the need for efficient compression mechanisms to enable better storage, transmission and processing of such data. To solve this problem, many of the existing compressors attempt to learn models for the data and perform prediction-based compression. Since neural networks are known as universal function approximators with the capability to learn arbitrarily complex mappings, and in practice show excellent performance in prediction tasks, we explore and devise methods to compress sequential data using neural network predictors. We combine recurrent neural network predictors with an arithmetic coder and losslessly compress a variety of synthetic, text and genomic datasets. The proposed compressor outperforms Gzip on the real datasets and achieves near-optimal compression for the synthetic datasets. The results also help understand why and where neural networks are good alternatives for traditional finite context models

Citations (70)

View on Semantic Scholar

Summary

The paper introduces DeepZip, a framework that uses RNNs to predict probability distributions for effective lossless compression.
It integrates recurrent neural networks with arithmetic coding, achieving competitive compression rates on diverse sequential data types.
Experimental results show that DeepZip outperforms standard compressors like Gzip and rivals specialized methods for text and genomic data.

DeepZip: Lossless Data Compression using Recurrent Neural Networks

The paper "DeepZip: Lossless Data Compression using Recurrent Neural Networks" presents an innovative approach to the compression of sequential data utilizing the capabilities of Recurrent Neural Networks (RNNs). The authors, Mohit Goyal et al., propose DeepZip, a framework integrating RNN-based probability predictors with arithmetic coding to achieve lossless compression on diverse data types, including synthetic, text, and genomic sequences.

Summary of the Approach

The key insight underlying DeepZip is the application of RNNs as effective predictors for compressing sequential data. The approach leverages the universal function approximation property of neural networks to estimate conditional probabilities for sequences, thus enabling efficient arithmetic coding of the data stream. The compression process comprises two main components: a probability predictor and an arithmetic coder. The probability predictor, modeled using various forms of neural networks such as fully connected networks, GRUs, and LSTMs, generates conditioned probability distributions over potential next symbols in the sequence. These probability estimates are then input into an arithmetic encoder, which transforms these distributions into a compressed bitstream.

Experimental Evaluation

DeepZip's performance evaluation includes both real-world datasets and synthetic sequences to demonstrate the breadth of its applicability. The real-world datasets span genomic sequences such as the human chromosome 1 and C. Elegans data, text datasets, and quality data from genomic sequences. DeepZip consistently outperforms the standard compressor Gzip across these datasets. It also shows competitive compression rates against more elaborate compressors like BSC, and specific dataset-centric solutions such as GeCo for genomic data and ZPAQ for text.

On synthetic datasets with known entropic characteristics, DeepZip effectively identifies and exploits redundancy that classical compressors like Gzip fail to capture. It achieves near-optimal compression rates, especially for complex sequences with greater dependencies or lower entropy. Such performance underlines its strength in generalizing the learned statistical model for diverse sequential patterns.

Implications and Future Work

From a theoretical perspective, the results prompt a reconsideration of traditional context-based compression methodologies and highlight RNNs as viable alternatives. This substitution not only offers improved performance but also reduces the dependency on manual feature engineering, relying instead on data-driven learning of sequence features. Furthermore, the ability of RNNs within DeepZip to handle long-term dependencies effectively showcases their utility beyond typical natural language processing tasks, extending into the domain of data compression.

Practically, there are substantial implications for storage and transmission efficiency across domains where large sequential datasets are prevalent, such as bioinformatics and text processing. The authors suggest future enhancements could involve implementing attention mechanisms to refine probability predictions and enable the model to adapt dynamically to non-stationary data sequences. Such improvements could further close the gap between achieved compression ratios and theoretical entropy limits.

In conclusion, DeepZip demonstrates the potential of integrating neural network-based models in lossless data compression frameworks. The approach not only outperforms traditional methods but also opens opportunities for novel architectures and techniques in approaching the compression of intricate sequential datasets. As the field develops, such neural network-based models promise to play a central role in advancing the performance and flexibility of data compression strategies.

PDF Markdown

Related Papers

GitHub

GitHub - mohit1997/DeepZip: NN based lossless compression (159 stars)

YouTube

Show All Videos