Context Tree Switching (CTS) Algorithm
- Context Tree Switching is a universal coding and prediction algorithm that extends the CTW framework by dynamically switching between expert predictors.
- It employs a modified model-mixing mechanism at each node to combine memoryless and split-context predictors, achieving improved empirical compression on benchmarks like the Calgary Corpus.
- The algorithm maintains O(nD) complexity with explicit redundancy bounds for PST-representable binary sources, making it suitable for modern high-throughput sequence modeling and online compression.
Context Tree Switching (CTS) is a universal coding and prediction algorithm that extends the Context Tree Weighting (CTW) technique for binary, stationary, -Markov sources. CTS modifies the model-mixing mechanism of CTW to encompass a larger model class with no additional asymptotic time or space complexity, and provably maintains efficient redundancy rates for coding. CTS achieves improved empirical compression performance relative to CTW, particularly on natural data streams such as those from the Calgary Corpus, and advances the state of the art for universal, provably optimal sequence prediction and compression in binary domains (Veness et al., 2011).
1. Model Assumptions and Formal Setup
CTS assumes the data sequence , where , is generated by a stationary -Markov source. There exists an (unknown) Markov order , such that the conditional source probability is local:
This model is represented by a Prediction Suffix Tree (PST) with leaf parameters . For any , where is the unique suffix of , the leaf probability is .
Universal coders such as CTS are evaluated by code-length , where is the model probability, and redundancy . Achieving is the target for universal codes.
2. Context Tree Weighting vs. Context Tree Switching
CTW recursively combines local models at each context: for context of length , define as the subsequence with most recent -suffix . The Krichevsky–Trofimov (KT) estimate for such a context is , a Bayesian mixture over Bernoulli parameters.
The standard CTW recursion:
CTS generalizes this by switching between two "experts" at each node, using a time-varying mixture rather than a fixed weight scheme:
- Expert 0: memoryless KT predictor
- Expert 1: split-context predictor, whose one-step predictive gain is:
CTS maintains two nonnegative weights and per context , summing to , and a global or local sequence of switch rates, e.g., for all nodes.
3. Algorithmic Recursion and Update Scheme
At each symbol , CTS traverses the context tree from root to leaf for . The update scheme is as follows:
- Base cases: for , ; new nodes are initialized .
CTS thus implements an online mixture over "stay-with-expert" or "switch-expert" decision paths at each node, allowing the effective model to dynamically adapt context depth and composition. The per-symbol time complexity is ; storage is at fixed .
4. Theoretical Properties and Redundancy Bounds
CTS matches the computational complexity of CTW, at total update time and memory. It provides rigorous redundancy bounds for any binary stationary source representable as a PST of depth at most .
Let be a PST of depth , leaves, and maximum context length . The cumulative code-length satisfies:
where:
- : CTW structure penalty,
- ,
- : log-likelihood for tree with parameters .
Thus, wrapped in an arithmetic coder, the total redundancy is
matching CTW up to an additive term— optimality for fixed .
5. Empirical Evaluation and Comparative Performance
CTS and CTW were evaluated on the 14-file Calgary Corpus, with compression measured in average bits/byte using binary arithmetic coding. Several algorithmic variants were tested, with and without enhancements such as count-halving, binary decomposition, and zero-redundancy estimators.
| Corpus | CTW (D=48) | CTS (D=48) | CTW* (D=48) | CTS* (D=48) |
|---|---|---|---|---|
| bib | 2.25 | 2.23 | 1.83 | 1.79 |
| book1 | 2.31 | 2.32 | 2.18 | 2.19 |
| geo | 5.01 | 5.05 | 4.53 | 4.18 |
- Unenhanced CTS achieves up to 7% smaller compressed size than CTW, with no case worse than 1%. Enhanced CTS* with provides up to 8% better compression than CTW*. Increasing tree depth to with enhancements further reduces bits/byte (e.g., $1.31$ on "trans").
- Weighted averages: PPM* ($2.09$), CTW* ($1.99$), PPMZ ($1.93$), CTS* ($1.93$), Deplump PPM variant ($1.89$). CTS* thereby matches PPMZ and closely approaches the leading Deplump variant, while maintaining universality and provable complexity (Veness et al., 2011).
6. Context and Significance
CTS broadens the model class over which effective mixture prediction is performed relative to CTW, incorporating arbitrary switching between context segmentation and memoryless prediction at each node. By preserving computational and asymptotic efficiency, and providing explicit redundancy guarantees, CTS is a robust universal coding method for binary, stationary contexts. Its empirical advantage is consistently demonstrated on standard benchmarks, and its theoretical regime includes all D-Markov sources representable by PSTs of bounded depth. This suggests CTS is well-suited for modern applications where adaptivity, universality, and provable performance are critical, such as high-throughput sequence modeling and online compression (Veness et al., 2011).