Context Tree Switching (CTS) Algorithm

Updated 9 February 2026

Context Tree Switching is a universal coding and prediction algorithm that extends the CTW framework by dynamically switching between expert predictors.
It employs a modified model-mixing mechanism at each node to combine memoryless and split-context predictors, achieving improved empirical compression on benchmarks like the Calgary Corpus.
The algorithm maintains O(nD) complexity with explicit redundancy bounds for PST-representable binary sources, making it suitable for modern high-throughput sequence modeling and online compression.

Context Tree Switching (CTS) is a universal coding and prediction algorithm that extends the Context Tree Weighting (CTW) technique for binary, stationary, $n$ -Markov sources. CTS modifies the model-mixing mechanism of CTW to encompass a larger model class with no additional asymptotic time or space complexity, and provably maintains efficient redundancy rates for coding. CTS achieves improved empirical compression performance relative to CTW, particularly on natural data streams such as those from the Calgary Corpus, and advances the state of the art for universal, provably optimal sequence prediction and compression in binary domains (Veness et al., 2011).

1. Model Assumptions and Formal Setup

CTS assumes the data sequence $x_{1:n}$ , where $x_i \in \mathsf{X} := \{0,1\}$ , is generated by a stationary $n$ -Markov source. There exists an (unknown) Markov order $D$ , such that the conditional source probability is local:

$\mu(x_t | x_{1:t-1}) = \mu(x_t | x_{t-D:t-1})$

This model is represented by a Prediction Suffix Tree (PST) $S \subseteq \mathsf{X}^{\leq D}$ with leaf parameters $\Theta_S \in [0,1]^{|S|}$ . For any $s \in S$ , where $s$ is the unique suffix of $x_{t-D+1:t}$ , the leaf probability is $\mu(x_t=1|x_{1:t-1}) = \theta_s$ .

Universal coders such as CTS are evaluated by code-length $l(x_{1:n}) = -\lfloor \log_2 \rho(x_{1:n}) \rfloor \pm O(1)$ , where $\rho$ is the model probability, and redundancy $r(x_{1:n}) = l(x_{1:n}) + \log_2 \mu(x_{1:n})$ . Achieving $r = O(\log n)$ is the target for universal codes.

2. Context Tree Weighting vs. Context Tree Switching

CTW recursively combines local models at each context: for context $c$ of length $\leq D$ , define $x^c_{1:n}$ as the subsequence with most recent $D$ -suffix $c$ . The Krichevsky–Trofimov (KT) estimate for such a context is $\xi_{KT}(x^c_{1:k})$ , a Bayesian mixture over Bernoulli parameters.

The standard CTW recursion:

$\mathrm{ctw}_D^c(x_{1:n}) = \frac{1}{2} \xi_{KT}(x^c_{1:n}) + \frac{1}{2} \mathrm{ctw}_{D-1}^{0c}(x_{1:n}) \cdot \mathrm{ctw}_{D-1}^{1c}(x_{1:n})$

CTS generalizes this by switching between two "experts" at each node, using a time-varying mixture rather than a fixed weight scheme:

Expert 0: memoryless KT predictor $\xi_{KT}$
Expert 1: split-context predictor, whose one-step predictive gain is:

$z_D^c(x_n|x_{<n}) = \frac{\mathrm{cts}_{D-1}^{0c}(x_{1:n})}{\mathrm{cts}_{D-1}^{0c}(x_{<n})} \frac{\mathrm{cts}_{D-1}^{1c}(x_{1:n})}{\mathrm{cts}_{D-1}^{1c}(x_{<n})}$

CTS maintains two nonnegative weights $k_c$ and $s_c$ per context $c$ , summing to $\mathrm{cts}_{D}^c(x_{1:n})$ , and a global or local sequence of switch rates, e.g., $\alpha_n^c = 1/n$ for all nodes.

3. Algorithmic Recursion and Update Scheme

At each symbol $x_n$ , CTS traverses the context tree from root to leaf for $c = \phi_D(x_{<n})$ . The update scheme is as follows:

$\begin{align*} \mathrm{cts}_D^c(x_{1:n}) &\gets k_c \cdot \xi_{KT}(x^c_n|x^c_{<n}) + s_c \cdot z_D^c(x_n|x_{<n}) \ k_c &\gets \alpha_{n+1}^c \mathrm{cts}_D^c(x_{1:n}) + (1 - 2\alpha_{n+1}^c) k_c \xi_{KT}(x^c_n|x^c_{<n}) \ s_c &\gets \alpha_{n+1}^c \mathrm{cts}_D^c(x_{1:n}) + (1 - 2\alpha_{n+1}^c) s_c z_D^c(x_n|x_{<n}) \end{align*}$

Base cases: for $D=0$ , $\mathrm{cts}_0^c(\ldots) \equiv \xi_{KT}(x^c_{1:n})$ ; new nodes are initialized $k_c = s_c = 1/2$ .

CTS thus implements an online mixture over "stay-with-expert" or "switch-expert" decision paths at each node, allowing the effective model to dynamically adapt context depth and composition. The per-symbol time complexity is $O(D)$ ; storage is $O(D)$ at fixed $D$ .

4. Theoretical Properties and Redundancy Bounds

CTS matches the computational complexity of CTW, at $O(nD)$ total update time and $O(D)$ memory. It provides rigorous redundancy bounds for any binary stationary source representable as a PST of depth at most $D$ .

Let $(S, \Theta_S)$ be a PST of depth $\leq D$ , $|S|$ leaves, and maximum context length $d(S)$ . The cumulative code-length satisfies:

$- \log_2 \mathrm{cts}_D(x_{1:n}) \leq \Gamma_D(S) + [d(S)+1] \log_2 n + |S| \gamma(n/|S|) - \log_2 \Pr(x_{1:n}|S, \Theta_S)$

where:

$\Gamma_D(S)$ : CTW structure penalty,
$\gamma(k) = \frac{1}{2}\log_2 k + 1$ ,
$- \log_2 \Pr(x_{1:n}|S, \Theta_S)$ : log-likelihood for tree $S$ with parameters $\Theta_S$ .

Thus, wrapped in an arithmetic coder, the total redundancy is

$\Gamma_D(S) + [d(S)+1] \log_2 n + |S| \gamma(n/|S|) + 2$

matching CTW up to an additive $[d(S) + 1]\log_2 n$ term— $O(\log n)$ optimality for fixed $D$ .

5. Empirical Evaluation and Comparative Performance

CTS and CTW were evaluated on the 14-file Calgary Corpus, with compression measured in average bits/byte using binary arithmetic coding. Several algorithmic variants were tested, with and without enhancements such as count-halving, binary decomposition, and zero-redundancy estimators.

Corpus	CTW (D=48)	CTS (D=48)	CTW* (D=48)	CTS* (D=48)
bib	2.25	2.23	1.83	1.79
book1	2.31	2.32	2.18	2.19
geo	5.01	5.05	4.53	4.18

Unenhanced CTS achieves up to 7% smaller compressed size than CTW, with no case worse than 1%. Enhanced CTS* with $D=48$ provides up to 8% better compression than CTW*. Increasing tree depth to $D=160$ with enhancements further reduces bits/byte (e.g., $1.31$ on "trans").
Weighted averages: PPM* ($2.09$), CTW* ($1.99$), PPMZ ($1.93$), CTS* ($1.93$), Deplump PPM variant ($1.89$). CTS* thereby matches PPMZ and closely approaches the leading Deplump variant, while maintaining universality and provable $O(nD)$ complexity (Veness et al., 2011).

6. Context and Significance

CTS broadens the model class over which effective mixture prediction is performed relative to CTW, incorporating arbitrary switching between context segmentation and memoryless prediction at each node. By preserving computational and asymptotic efficiency, and providing explicit redundancy guarantees, CTS is a robust universal coding method for binary, stationary contexts. Its empirical advantage is consistently demonstrated on standard benchmarks, and its theoretical regime includes all D-Markov sources representable by PSTs of bounded depth. This suggests CTS is well-suited for modern applications where adaptivity, universality, and provable performance are critical, such as high-throughput sequence modeling and online compression (Veness et al., 2011).

Markdown Upgrade to Chat

References (1)

Context Tree Switching (2011)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context Tree Switching (CTS).