Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding Entropy Coding With Asymmetric Numeral Systems (ANS): a Statistician's Perspective (2201.01741v2)

Published 5 Jan 2022 in stat.ML, cs.IT, cs.LG, and math.IT

Abstract: Entropy coding is the backbone data compression. Novel machine-learning based compression methods often use a new entropy coder called Asymmetric Numeral Systems (ANS) [Duda et al., 2015], which provides very close to optimal bitrates and simplifies [Townsend et al., 2019] advanced compression techniques such as bits-back coding. However, researchers with a background in machine learning often struggle to understand how ANS works, which prevents them from exploiting its full versatility. This paper is meant as an educational resource to make ANS more approachable by presenting it from a new perspective of latent variable models and the so-called bits-back trick. We guide the reader step by step to a complete implementation of ANS in the Python programming language, which we then generalize for more advanced use cases. We also present and empirically evaluate an open-source library of various entropy coders designed for both research and production use. Related teaching videos and problem sets are available online.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Robert Bamler (33 papers)
Citations (10)

Summary

  • The paper demonstrates that ANS achieves near-optimal compression by leveraging latent variable models and the bits-back trick.
  • The methodology explains a transition from a trivial stream code to a practical streaming ANS algorithm using bulk and head data structures.
  • The study highlights the efficiency of the open-source 'constriction' library, offering fast, low-overhead implementations for modern machine learning compression.

Entropy coding is a fundamental component of data compression, particularly gaining prominence in modern machine learning-based compression methods. Asymmetric Numeral Systems (ANS) has emerged as a preferred entropy coder in this domain due to its near-optimal compression ratios and suitability for advanced techniques like bits-back coding. However, its algorithmic intricacies can be challenging for researchers from a machine learning or statistics background. This paper provides an educational perspective on ANS, framing it in terms of latent variable models and the bits-back trick, and guides the reader toward a practical implementation.

The theoretical foundation of lossless compression, the source coding theorem, states that the minimum achievable expected bitrate for a message is its entropy HP[M]H_P[\mathcal{M}], where PP is the probabilistic model of the data source. Practical entropy coders aim to approach this bound. Two main types exist: symbol codes and stream codes. Symbol codes, like Huffman coding, assign a fixed number of bits to each symbol, leading to an overhead of up to 1 bit per symbol. While suitable for high-entropy-per-symbol data (e.g., text in gzip), this per-symbol overhead is prohibitive for low-entropy-per-symbol data generated by many modern machine learning models. Stream codes, such as Arithmetic Coding (AC), Range Coding (RC), and ANS, amortize this overhead over multiple symbols, achieving bitrates very close to the entropy bound. Unlike AC and RC, which operate as queues (first-in-first-out), ANS operates as a stack (last-in-first-out), which simplifies implementations involving latent variable models.

The paper introduces ANS by first examining a trivial stream code based on positional numeral systems (PNS). For a sequence of symbols from a uniform distribution, interpreting the symbols as digits in a base- A|\mathcal{A}| number provides an optimal code. The encoding involves treating the sequence (x1,,xk)(x_1, \dots, x_k) as the number i=1kxiAki\sum_{i=1}^k x_i |\mathcal{A}|^{k-i}. The simplest implementation has stack semantics: symbols are encoded from most significant to least significant digits but decoded from least significant to most significant (Listing 1). This code also demonstrates amortization, where a single bit in the compressed output depends on multiple symbols, and it can handle sequences with varying alphabet sizes by changing the base dynamically (Listing 2).

To handle arbitrary non-uniform probability distributions Pi(xi)P_i(x_i) for each symbol xix_i, ANS uses an approximation Qi(xi)=mi(xi)/nQ_i(x_i) = m_i(x_i)/n, where n=2precisionn=2^{\text{precision}} and mi(xi)m_i(x_i) are integers summing to nn over the alphabet (Figure 2). This approximation can be viewed through a latent variable model perspective: sampling xix_i from QiQ_i is equivalent to sampling a latent integer ziz_i uniformly from {0,,n1}\{0, \dots, n-1\} and then mapping ziz_i to xix_i based on which of nn disjoint subranges Si(xi)\mathcal{S}_i(x_i) of size mi(xi)m_i(x_i) it falls into. A naive approach would be to encode the sequence of ziz_i values using the PNS UniformCoder. However, this is inefficient because the encoder's choice of ziz_i within Si(xi)\mathcal{S}_i(x_i) contains log2mi(xi)\log_2 m_i(x_i) bits of information that are discarded by the decoder (which only needs to know which Si(xi)\mathcal{S}_i(x_i) contains ziz_i).

The bits-back trick addresses this by utilizing this "wasted" information. When encoding symbol xix_i, instead of picking an arbitrary ziSi(xi)z_i \in \mathcal{S}_i(x_i), the encoder decodes log2mi(xi)\log_2 m_i(x_i) bits from the current state of the compressed data, interpreting this as a number within {0,,mi(xi)1}\{0, \dots, m_i(x_i)-1\}. Let this be ziz'_i. The actual value encoded is zi=zi+xi<ximi(xi)z_i = z'_i + \sum_{x_i' < x_i} m_i(x_i'), which is guaranteed to fall within Si(xi)\mathcal{S}_i(x_i). This ziz_i is then encoded onto the state using the uniform distribution over {0,,n1}\{0, \dots, n-1\}, costing log2n\log_2 n bits. The net contribution is log2nlog2mi(xi)=log2Qi(xi)\log_2 n - \log_2 m_i(x_i) = -\log_2 Q_i(x_i), the desired information content (Figure 2). The SlowAnsCoder (Listing 3, inlined in Listing 4) implements this logic using the UniformCoder as an internal state. Decoding inverts this process: it decodes ziz_i using the uniform distribution over {0,,n1}\{0, \dots, n-1\}, finds xix_i such that ziSi(xi)z_i \in \mathcal{S}_i(x_i), calculates zi=zixi<ximi(xi)z'_i = z_i - \sum_{x_i' < x_i} m_i(x_i'), and then encodes ziz'_i back onto the state using the uniform distribution over {0,,mi(xi)1}\{0, \dots, m_i(x_i)-1\}. This effectively restores the state as if the initial log2mi(xi)\log_2 m_i(x_i) bits were never decoded. This SlowAnsCoder achieves near-optimal compression but is computationally expensive because it represents the entire compressed stream as a single, potentially very large integer, leading to O(k2)O(k^2) runtime for a message of length kk.

The practical streaming ANS algorithm (Listing 5) improves efficiency by splitting the compressed data into a bulk (a vector of fixed-size words) and a small, fixed-capacity head (Figure 3). Arithmetic operations primarily occur on the head, which has a bounded size (e.g., 64 bits). When the head overflows or underflows certain thresholds dictated by invariants (head<22×precisionhead < 2^{2 \times \text{precision}} and head2precisionhead \geq 2^{\text{precision}} if bulk is not empty), precision bits are transferred between head and bulk. This allows for constant (amortized) time operations on the bulk using bit shifts and masks (leveraging n=2precisionn=2^{\text{precision}}) while keeping the head arithmetic fast. This streaming approach introduces a small, usually negligible, overhead compared to the theoretical optimum due to Benford's Law effects on bits transferred between head and bulk. The implementation (Listing 5) uses bitwise operations to avoid slow integer division during decoding.

The paper discusses variations on the basic streaming ANS. Generalized streaming allows for different word_size and head_capacity configurations to tune performance and memory usage (Section 4.1). Random-access decoding (seeking) is simpler in ANS than AC/RC because the encoder and decoder share the same state. Checkpoints of the bulk length and head value can be saved during encoding to allow seeking later during decoding (Listing 6). A more advanced variation, the ChainCoder (Listing 8 sketch), is proposed to address a non-local effect in standard ANS where changing the model for one symbol can affect the decoding of subsequent symbols. By using separate stacks for the data decoded from the latent variable (ziz_i) and the data encoded back (ziz'_i), this ripple effect is prevented, which could be beneficial for optimizing probabilistic models end-to-end through the coder.

Finally, the paper presents constriction, an open-source library (Section 5.1) providing efficient implementations of ANS, RC, and AC, with bindings for both Rust (for performance) and Python (for research). The library aims to bridge the gap between machine learning and systems communities in compression research. Empirical benchmarks (Section 5.2, Table 1, Figures 4 & 5) on real-world data show that both ANS and RC in constriction achieve negligible bitrate overhead (e.g., <0.1%<0.1\% for the default configuration) while being considerably faster than a standard Arithmetic Coding implementation. ANS typically offers faster decoding, while RC is slightly faster at encoding. The choice between ANS and RC in practice depends on the model architecture (stack for latent variables/bits-back vs. queue for autoregressive models). Constriction provides default configurations optimized for performance across various entropy regimes and options for tuning using lookup tables for specific needs.