Entropy-Normalized Serialization
- Entropy-normalized serialization is a lossless compression technique that represents a message by its lexicographic index among all symbol permutations matching the observed frequencies.
- It employs Combinatorial Entropy Encoding (CEE), an integer-based method that achieves near-optimal compression within a logarithmic overhead without relying on fractional arithmetic.
- The method is practical for text, genomic, and packet header data, though it requires transmitting symbol counts and precomputing multinomial tables for efficient encoding and decoding.
Entropy-normalized serialization refers to a class of lossless data compression techniques in which a complete message is represented by its lexicographic index among all possible symbol permutations with fixed symbol counts, resulting in a bitstream whose length matches the message's Shannon entropy. Combinatorial Entropy Encoding (CEE) is the canonical methodology underpinning entropy-normalized serialization, providing an encoding that is purely integer-based, achieves optimal compression up to a logarithmic term, and eliminates the need for fractional arithmetic or explicit source models. In CEE, the compressed output consists of the lexicographic index (under the multinomial enumeration) plus a vector of symbol frequencies, thus enabling unique and optimal reconstruction of the original message (Siddique, 2017).
1. Formalism and Mathematical Foundations
Let the alphabet , with message containing copies of such that . The total number of distinct permutations is given by the multinomial coefficient: Assigning to its zero-based lexicographic rank among all such permutations, one obtains the compact index for entropy-normalized serialization. The value is computed as
with representing the symbol counts remaining before placing .
In the binary case (), the calculation simplifies to: where is the th bit, indexed from the least significant position.
2. Encoding and Decoding Algorithms
CEE relies on iterative computation of the lexicographic index for encoding, and the inversion of this process for decoding.
Encoding
- Initialize and the counter array .
- For each symbol , add for every with :
- Decrement .
- Return .
Decoding
- Initialize as above.
- For each symbol position, iterate with . Subtract the weighted multinomial term until , where is the term as above.
- Set , update and counts.
- Continue until all symbols are decoded.
This process involves no multiplications at encode-time and relies on precomputed integer multinomials.
3. Efficiency, Computational Complexity, and Operational Trade-Offs
CEE requires lookups per message of length over an alphabet of size , reducing to in the binary case. Per-symbol operations are integer additions, subtractions, and array indexing. Space complexity is dominated by storage for factorial or multinomial tables, with precomputation in (or for Pascal's triangle in the binary case).
| Operation | Encoding/Decoding Complexity | Memory (Precomputed Tables) |
|---|---|---|
| Arbitrary Alphabet | ||
| Binary Alphabet |
No explicit entropy model is required, and side information cost ( bits for ) becomes negligible as . Large alphabet sizes increase decoding cost, because the decoder branches over symbols per output.
4. Compression Bound and Relation to Shannon Entropy
CEE achieves a theoretical code length of bits for message . By combinatorial analysis,
where , . The redundancy per symbol vanishes as , guaranteeing asymptotic optimality matching the Shannon entropy. CEE uses strictly fewer than bits even for finite , outperforming Huffman and fixed-length codes in such settings.
5. Detailed Worked Examples
Binary Case: For with , , , and to indexed from the right:
- (): no addition, .
- (): add , .
- (): no addition, .
- (): add , .
- (): add , .
Total . The bitstream “1000” and suffice to reconstruct .
Non-binary Case: The message "BANANA" () over with yields a lexicographic index of 22 among 60 possible permutations. Serialized, this requires bits (e.g., “010110”) plus symbol counts.
6. Comparative Analysis with Huffman and Arithmetic Coding
CEE maps the entire message to a single integer using purely integer operations, contrasting with:
- Huffman Coding: Assigns static, integer-length codewords; suffers inefficiency for non-dyadic distributions and cannot utilize fractional bits.
- Arithmetic Coding: Maps to a fractional interval via repeated multiplication and renormalization, requiring real arithmetic, high precision, or scaled integer arithmetic.
CEE, by contrast, performs only additions and avoids multiplication at encode time. It does not require prior knowledge or estimation of probabilities but operates on observed symbol frequencies per block.
7. Applications, Limitations, and Open Questions
Applications include lossless compression scenarios for small alphabets (such as textual, genomic, or packet header data), embedded or hardware systems lacking floating-point units, and any situation favoring efficient block-adaptive coding.
Limitations are:
- Requirement to transmit symbol counts per block (amortized for large ).
- Memory overhead for binomial/multinomial tables when is large.
- Linear cost in the alphabet size during decoding, impacting scalability for large .
- Not suited to streaming contexts without buffering, as symbol counts must be known per block.
Open problems include unifying the side-information and index streams to remove header redundancy, and adaptation for Markov or context-based sources beyond IID models, suggesting the need for hybrid or hierarchical models atop CEE (Siddique, 2017).