Understanding Entropy Coding With Asymmetric Numeral Systems (ANS): a Statistician's Perspective (2201.01741v2)

Published 5 Jan 2022 in stat.ML, cs.IT, cs.LG, and math.IT

Abstract: Entropy coding is the backbone data compression. Novel machine-learning based compression methods often use a new entropy coder called Asymmetric Numeral Systems (ANS) [Duda et al., 2015], which provides very close to optimal bitrates and simplifies [Townsend et al., 2019] advanced compression techniques such as bits-back coding. However, researchers with a background in machine learning often struggle to understand how ANS works, which prevents them from exploiting its full versatility. This paper is meant as an educational resource to make ANS more approachable by presenting it from a new perspective of latent variable models and the so-called bits-back trick. We guide the reader step by step to a complete implementation of ANS in the Python programming language, which we then generalize for more advanced use cases. We also present and empirically evaluate an open-source library of various entropy coders designed for both research and production use. Related teaching videos and problem sets are available online.

Authors (1)

Robert Bamler (33 papers)

Citations (10)

View on Semantic Scholar

Summary

The paper demonstrates that ANS achieves near-optimal compression by leveraging latent variable models and the bits-back trick.
The methodology explains a transition from a trivial stream code to a practical streaming ANS algorithm using bulk and head data structures.
The study highlights the efficiency of the open-source 'constriction' library, offering fast, low-overhead implementations for modern machine learning compression.

Entropy coding is a fundamental component of data compression, particularly gaining prominence in modern machine learning-based compression methods. Asymmetric Numeral Systems (ANS) has emerged as a preferred entropy coder in this domain due to its near-optimal compression ratios and suitability for advanced techniques like bits-back coding. However, its algorithmic intricacies can be challenging for researchers from a machine learning or statistics background. This paper provides an educational perspective on ANS, framing it in terms of latent variable models and the bits-back trick, and guides the reader toward a practical implementation.

The theoretical foundation of lossless compression, the source coding theorem, states that the minimum achievable expected bitrate for a message is its entropy $H_P[\mathcal{M}]$ , where $P$ is the probabilistic model of the data source. Practical entropy coders aim to approach this bound. Two main types exist: symbol codes and stream codes. Symbol codes, like Huffman coding, assign a fixed number of bits to each symbol, leading to an overhead of up to 1 bit per symbol. While suitable for high-entropy-per-symbol data (e.g., text in gzip), this per-symbol overhead is prohibitive for low-entropy-per-symbol data generated by many modern machine learning models. Stream codes, such as Arithmetic Coding (AC), Range Coding (RC), and ANS, amortize this overhead over multiple symbols, achieving bitrates very close to the entropy bound. Unlike AC and RC, which operate as queues (first-in-first-out), ANS operates as a stack (last-in-first-out), which simplifies implementations involving latent variable models.

The paper introduces ANS by first examining a trivial stream code based on positional numeral systems (PNS). For a sequence of symbols from a uniform distribution, interpreting the symbols as digits in a base- $|\mathcal{A}|$ number provides an optimal code. The encoding involves treating the sequence $(x_1, \dots, x_k)$ as the number $\sum_{i=1}^k x_i |\mathcal{A}|^{k-i}$ . The simplest implementation has stack semantics: symbols are encoded from most significant to least significant digits but decoded from least significant to most significant (Listing 1). This code also demonstrates amortization, where a single bit in the compressed output depends on multiple symbols, and it can handle sequences with varying alphabet sizes by changing the base dynamically (Listing 2).

To handle arbitrary non-uniform probability distributions $P_i(x_i)$ for each symbol $x_i$ , ANS uses an approximation $Q_i(x_i) = m_i(x_i)/n$ , where $n=2^{\text{precision}}$ and $m_i(x_i)$ are integers summing to $n$ over the alphabet (Figure 2). This approximation can be viewed through a latent variable model perspective: sampling $x_i$ from $Q_i$ is equivalent to sampling a latent integer $z_i$ uniformly from $\{0, \dots, n-1\}$ and then mapping $z_i$ to $x_i$ based on which of $n$ disjoint subranges $\mathcal{S}_i(x_i)$ of size $m_i(x_i)$ it falls into. A naive approach would be to encode the sequence of $z_i$ values using the PNS UniformCoder. However, this is inefficient because the encoder's choice of $z_i$ within $\mathcal{S}_i(x_i)$ contains $\log_2 m_i(x_i)$ bits of information that are discarded by the decoder (which only needs to know which $\mathcal{S}_i(x_i)$ contains $z_i$ ).

The bits-back trick addresses this by utilizing this "wasted" information. When encoding symbol $x_i$ , instead of picking an arbitrary $z_i \in \mathcal{S}_i(x_i)$ , the encoder decodes $\log_2 m_i(x_i)$ bits from the current state of the compressed data, interpreting this as a number within $\{0, \dots, m_i(x_i)-1\}$ . Let this be $z'_i$ . The actual value encoded is $z_i = z'_i + \sum_{x_i' < x_i} m_i(x_i')$ , which is guaranteed to fall within $\mathcal{S}_i(x_i)$ . This $z_i$ is then encoded onto the state using the uniform distribution over $\{0, \dots, n-1\}$ , costing $\log_2 n$ bits. The net contribution is $\log_2 n - \log_2 m_i(x_i) = -\log_2 Q_i(x_i)$ , the desired information content (Figure 2). The SlowAnsCoder (Listing 3, inlined in Listing 4) implements this logic using the UniformCoder as an internal state. Decoding inverts this process: it decodes $z_i$ using the uniform distribution over $\{0, \dots, n-1\}$ , finds $x_i$ such that $z_i \in \mathcal{S}_i(x_i)$ , calculates $z'_i = z_i - \sum_{x_i' < x_i} m_i(x_i')$ , and then encodes $z'_i$ back onto the state using the uniform distribution over $\{0, \dots, m_i(x_i)-1\}$ . This effectively restores the state as if the initial $\log_2 m_i(x_i)$ bits were never decoded. This SlowAnsCoder achieves near-optimal compression but is computationally expensive because it represents the entire compressed stream as a single, potentially very large integer, leading to $O(k^2)$ runtime for a message of length $k$ .

The practical streaming ANS algorithm (Listing 5) improves efficiency by splitting the compressed data into a bulk (a vector of fixed-size words) and a small, fixed-capacity head (Figure 3). Arithmetic operations primarily occur on the head, which has a bounded size (e.g., 64 bits). When the head overflows or underflows certain thresholds dictated by invariants ( $head < 2^{2 \times \text{precision}}$ and $head \geq 2^{\text{precision}}$ if bulk is not empty), precision bits are transferred between head and bulk. This allows for constant (amortized) time operations on the bulk using bit shifts and masks (leveraging $n=2^{\text{precision}}$ ) while keeping the head arithmetic fast. This streaming approach introduces a small, usually negligible, overhead compared to the theoretical optimum due to Benford's Law effects on bits transferred between head and bulk. The implementation (Listing 5) uses bitwise operations to avoid slow integer division during decoding.

The paper discusses variations on the basic streaming ANS. Generalized streaming allows for different word_size and head_capacity configurations to tune performance and memory usage (Section 4.1). Random-access decoding (seeking) is simpler in ANS than AC/RC because the encoder and decoder share the same state. Checkpoints of the bulk length and head value can be saved during encoding to allow seeking later during decoding (Listing 6). A more advanced variation, the ChainCoder (Listing 8 sketch), is proposed to address a non-local effect in standard ANS where changing the model for one symbol can affect the decoding of subsequent symbols. By using separate stacks for the data decoded from the latent variable ( $z_i$ ) and the data encoded back ( $z'_i$ ), this ripple effect is prevented, which could be beneficial for optimizing probabilistic models end-to-end through the coder.

Finally, the paper presents constriction, an open-source library (Section 5.1) providing efficient implementations of ANS, RC, and AC, with bindings for both Rust (for performance) and Python (for research). The library aims to bridge the gap between machine learning and systems communities in compression research. Empirical benchmarks (Section 5.2, Table 1, Figures 4 & 5) on real-world data show that both ANS and RC in constriction achieve negligible bitrate overhead (e.g., $<0.1\%$ for the default configuration) while being considerably faster than a standard Arithmetic Coding implementation. ANS typically offers faster decoding, while RC is slightly faster at encoding. The choice between ANS and RC in practice depends on the model architecture (stack for latent variables/bits-back vs. queue for autoregressive models). Constriction provides default configurations optimized for performance across various entropy regimes and options for tuning using lookup tables for specific needs.

PDF Markdown

Understanding Entropy Coding With Asymmetric Numeral Systems (ANS): a Statistician's Perspective (2201.01741v2)

Summary

Related Papers