Compressing Multisets with Large Alphabets using Bits-Back Coding (2107.09202v2)

Published 15 Jul 2021 in cs.IT, cs.LG, eess.SP, and math.IT

Abstract: Current methods which compress multisets at an optimal rate have computational complexity that scales linearly with alphabet size, making them too slow to be practical in many real-world settings. We show how to convert a compression algorithm for sequences into one for multisets, in exchange for an additional complexity term that is quasi-linear in sequence length. This allows us to compress multisets of exchangeable symbols at an optimal rate, with computational complexity decoupled from the alphabet size. The key insight is to avoid encoding the multiset directly, and instead compress a proxy sequence, using a technique called `bits-back coding'. We demonstrate the method experimentally on tasks which are intractable with previous optimal-rate methods: compression of multisets of images and JavaScript Object Notation (JSON) files. Code for our experiments is available at https://github.com/facebookresearch/multiset-compression.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a method to achieve optimal multiset compression using bits-back coding that decouples complexity from alphabet size.
It repurposes sequence compression algorithms with asymmetric numeral systems to transform multisets into proxy sequences that reduce average message length.
Empirical results confirm that the approach compresses multisets near their theoretical limits, proving efficiency across diverse large-data applications.

An Expert Overview of "Compressing Multisets with Large Alphabets using Bits-Back Coding"

In the domain of data compression, traditional methods that handle multisets tend to be computationally prohibitive as the size of the alphabet expands. The paper "Compressing Multisets with Large Alphabets using Bits-Back Coding" addresses this challenge by proposing a method that achieves optimal compression rates for multisets of exchangeable symbols, with computational complexity dissociated from alphabet size.

Core Contributions and Methodology

The authors introduce a technique to repurpose sequence compression algorithms for multisets, by leveraging a novel approach known as bits-back coding. The crux of this methodology lies in converting the multiset into a proxy sequence, which simplifies the compression process. This transformation is facilitated through the utilization of asymmetric numeral systems (ANS), which alternate between encoding and decoding operations during the process of compression and decompression.

The authors meticulously present how this method enables the compression of multisets by encoding a sequence that contains the same elements as the multiset, effectively reducing the average message length by the amount needed to represent permutations. Through the combination of entropy coding and ANS, the method is not only computationally efficient but also achieves the optimal rate of compression theoretically possible for multisets.

Numerical Results and Computational Complexity

Experimentation plays a pivotal role in demonstrating the feasibility and efficiency of the proposed method, especially when dealing with large alphabets, such as those encountered in the compression of multisets of images and JSON files. The authors present strong numerical results indicating that the method compresses multisets close to their theoretical information content across varying multiset and alphabet sizes. Importantly, the time complexity, both expected and worst-case, scales independently of the alphabet size, with encoding and decoding operations scaling efficiently with multiset size and complexity of the entropy coder used, as highlighted in their findings displayed in Table \ref{tab:complexity}.

Implications and Future Directions

The implications of this work are significant for fields where handling and compressing large datasets with inherent symmetries is a regular occurrence, such as in databases, machine learning, and data analysis applications. This methodology paves the way for future exploration into adaptive entropy coding schemes that could incorporate the proposed framework, potentially reducing the rate further by accounting for dependencies between symbols.

A particularly promising area for further research is exploring the utility of this methodology in scenarios with non-i.i.d. symbols by adapting the ANS to dynamically accommodate changing symbol distributions. Such adaptations would enhance the applicability of the technique across diverse data structures, like nested multisets encountered in hierarchical datasets or complex tree structures.

Conclusion

This paper offers a comprehensive and innovative approach to multiset compression by effectively leveraging bits-back coding. By decoupling computational complexity from alphabet size and maintaining optimal compression rates, the authors present a robust framework for future exploration in efficient data compression. The insights offered could influence a broad range of applications, demonstrating potential for wide-ranging impact in computational efficiency across different domains.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/multiset-compression: Official code accompanying the arXiv paper Compressing Multisets with Large Alphabets (26 stars)

YouTube

Show All Videos