- The paper demonstrates that ANS achieves near-optimal compression by leveraging latent variable models and the bits-back trick.
- The methodology explains a transition from a trivial stream code to a practical streaming ANS algorithm using bulk and head data structures.
- The study highlights the efficiency of the open-source 'constriction' library, offering fast, low-overhead implementations for modern machine learning compression.
Entropy coding is a fundamental component of data compression, particularly gaining prominence in modern machine learning-based compression methods. Asymmetric Numeral Systems (ANS) has emerged as a preferred entropy coder in this domain due to its near-optimal compression ratios and suitability for advanced techniques like bits-back coding. However, its algorithmic intricacies can be challenging for researchers from a machine learning or statistics background. This paper provides an educational perspective on ANS, framing it in terms of latent variable models and the bits-back trick, and guides the reader toward a practical implementation.
The theoretical foundation of lossless compression, the source coding theorem, states that the minimum achievable expected bitrate for a message is its entropy HP[M], where P is the probabilistic model of the data source. Practical entropy coders aim to approach this bound. Two main types exist: symbol codes and stream codes. Symbol codes, like Huffman coding, assign a fixed number of bits to each symbol, leading to an overhead of up to 1 bit per symbol. While suitable for high-entropy-per-symbol data (e.g., text in gzip), this per-symbol overhead is prohibitive for low-entropy-per-symbol data generated by many modern machine learning models. Stream codes, such as Arithmetic Coding (AC), Range Coding (RC), and ANS, amortize this overhead over multiple symbols, achieving bitrates very close to the entropy bound. Unlike AC and RC, which operate as queues (first-in-first-out), ANS operates as a stack (last-in-first-out), which simplifies implementations involving latent variable models.
The paper introduces ANS by first examining a trivial stream code based on positional numeral systems (PNS). For a sequence of symbols from a uniform distribution, interpreting the symbols as digits in a base- ∣A∣ number provides an optimal code. The encoding involves treating the sequence (x1,…,xk) as the number ∑i=1kxi∣A∣k−i. The simplest implementation has stack semantics: symbols are encoded from most significant to least significant digits but decoded from least significant to most significant (Listing 1). This code also demonstrates amortization, where a single bit in the compressed output depends on multiple symbols, and it can handle sequences with varying alphabet sizes by changing the base dynamically (Listing 2).
To handle arbitrary non-uniform probability distributions Pi(xi) for each symbol xi, ANS uses an approximation Qi(xi)=mi(xi)/n, where n=2precision and mi(xi) are integers summing to n over the alphabet (Figure 2). This approximation can be viewed through a latent variable model perspective: sampling xi from Qi is equivalent to sampling a latent integer zi uniformly from {0,…,n−1} and then mapping zi to xi based on which of n disjoint subranges Si(xi) of size mi(xi) it falls into. A naive approach would be to encode the sequence of zi values using the PNS UniformCoder. However, this is inefficient because the encoder's choice of zi within Si(xi) contains log2mi(xi) bits of information that are discarded by the decoder (which only needs to know which Si(xi) contains zi).
The bits-back trick addresses this by utilizing this "wasted" information. When encoding symbol xi, instead of picking an arbitrary zi∈Si(xi), the encoder decodes log2mi(xi) bits from the current state of the compressed data, interpreting this as a number within {0,…,mi(xi)−1}. Let this be zi′. The actual value encoded is zi=zi′+∑xi′<ximi(xi′), which is guaranteed to fall within Si(xi). This zi is then encoded onto the state using the uniform distribution over {0,…,n−1}, costing log2n bits. The net contribution is log2n−log2mi(xi)=−log2Qi(xi), the desired information content (Figure 2). The SlowAnsCoder
(Listing 3, inlined in Listing 4) implements this logic using the UniformCoder as an internal state. Decoding inverts this process: it decodes zi using the uniform distribution over {0,…,n−1}, finds xi such that zi∈Si(xi), calculates zi′=zi−∑xi′<ximi(xi′), and then encodes zi′ back onto the state using the uniform distribution over {0,…,mi(xi)−1}. This effectively restores the state as if the initial log2mi(xi) bits were never decoded. This SlowAnsCoder achieves near-optimal compression but is computationally expensive because it represents the entire compressed stream as a single, potentially very large integer, leading to O(k2) runtime for a message of length k.
The practical streaming ANS algorithm (Listing 5) improves efficiency by splitting the compressed data into a bulk
(a vector of fixed-size words) and a small, fixed-capacity head
(Figure 3). Arithmetic operations primarily occur on the head
, which has a bounded size (e.g., 64 bits). When the head
overflows or underflows certain thresholds dictated by invariants (head<22×precision and head≥2precision if bulk is not empty), precision
bits are transferred between head
and bulk
. This allows for constant (amortized) time operations on the bulk
using bit shifts and masks (leveraging n=2precision) while keeping the head
arithmetic fast. This streaming approach introduces a small, usually negligible, overhead compared to the theoretical optimum due to Benford's Law effects on bits transferred between head
and bulk
. The implementation (Listing 5) uses bitwise operations to avoid slow integer division during decoding.
The paper discusses variations on the basic streaming ANS. Generalized streaming allows for different word_size
and head_capacity
configurations to tune performance and memory usage (Section 4.1). Random-access decoding (seeking) is simpler in ANS than AC/RC because the encoder and decoder share the same state. Checkpoints of the bulk
length and head
value can be saved during encoding to allow seeking later during decoding (Listing 6). A more advanced variation, the ChainCoder
(Listing 8 sketch), is proposed to address a non-local effect in standard ANS where changing the model for one symbol can affect the decoding of subsequent symbols. By using separate stacks for the data decoded from the latent variable (zi) and the data encoded back (zi′), this ripple effect is prevented, which could be beneficial for optimizing probabilistic models end-to-end through the coder.
Finally, the paper presents constriction
, an open-source library (Section 5.1) providing efficient implementations of ANS, RC, and AC, with bindings for both Rust (for performance) and Python (for research). The library aims to bridge the gap between machine learning and systems communities in compression research. Empirical benchmarks (Section 5.2, Table 1, Figures 4 & 5) on real-world data show that both ANS and RC in constriction
achieve negligible bitrate overhead (e.g., <0.1% for the default configuration) while being considerably faster than a standard Arithmetic Coding implementation. ANS typically offers faster decoding, while RC is slightly faster at encoding. The choice between ANS and RC in practice depends on the model architecture (stack for latent variables/bits-back vs. queue for autoregressive models). Constriction
provides default configurations optimized for performance across various entropy regimes and options for tuning using lookup tables for specific needs.