Papers
Topics
Authors
Recent
Search
2000 character limit reached

Context Tree Switching (CTS) Algorithm

Updated 9 February 2026
  • Context Tree Switching is a universal coding and prediction algorithm that extends the CTW framework by dynamically switching between expert predictors.
  • It employs a modified model-mixing mechanism at each node to combine memoryless and split-context predictors, achieving improved empirical compression on benchmarks like the Calgary Corpus.
  • The algorithm maintains O(nD) complexity with explicit redundancy bounds for PST-representable binary sources, making it suitable for modern high-throughput sequence modeling and online compression.

Context Tree Switching (CTS) is a universal coding and prediction algorithm that extends the Context Tree Weighting (CTW) technique for binary, stationary, nn-Markov sources. CTS modifies the model-mixing mechanism of CTW to encompass a larger model class with no additional asymptotic time or space complexity, and provably maintains efficient redundancy rates for coding. CTS achieves improved empirical compression performance relative to CTW, particularly on natural data streams such as those from the Calgary Corpus, and advances the state of the art for universal, provably optimal sequence prediction and compression in binary domains (Veness et al., 2011).

1. Model Assumptions and Formal Setup

CTS assumes the data sequence x1:nx_{1:n}, where xiX:={0,1}x_i \in \mathsf{X} := \{0,1\}, is generated by a stationary nn-Markov source. There exists an (unknown) Markov order DD, such that the conditional source probability is local:

μ(xtx1:t1)=μ(xtxtD:t1)\mu(x_t | x_{1:t-1}) = \mu(x_t | x_{t-D:t-1})

This model is represented by a Prediction Suffix Tree (PST) SXDS \subseteq \mathsf{X}^{\leq D} with leaf parameters ΘS[0,1]S\Theta_S \in [0,1]^{|S|}. For any sSs \in S, where ss is the unique suffix of xtD+1:tx_{t-D+1:t}, the leaf probability is μ(xt=1x1:t1)=θs\mu(x_t=1|x_{1:t-1}) = \theta_s.

Universal coders such as CTS are evaluated by code-length l(x1:n)=log2ρ(x1:n)±O(1)l(x_{1:n}) = -\lfloor \log_2 \rho(x_{1:n}) \rfloor \pm O(1), where ρ\rho is the model probability, and redundancy r(x1:n)=l(x1:n)+log2μ(x1:n)r(x_{1:n}) = l(x_{1:n}) + \log_2 \mu(x_{1:n}). Achieving r=O(logn)r = O(\log n) is the target for universal codes.

2. Context Tree Weighting vs. Context Tree Switching

CTW recursively combines local models at each context: for context cc of length D\leq D, define x1:ncx^c_{1:n} as the subsequence with most recent DD-suffix cc. The Krichevsky–Trofimov (KT) estimate for such a context is ξKT(x1:kc)\xi_{KT}(x^c_{1:k}), a Bayesian mixture over Bernoulli parameters.

The standard CTW recursion:

ctwDc(x1:n)=12ξKT(x1:nc)+12ctwD10c(x1:n)ctwD11c(x1:n)\mathrm{ctw}_D^c(x_{1:n}) = \frac{1}{2} \xi_{KT}(x^c_{1:n}) + \frac{1}{2} \mathrm{ctw}_{D-1}^{0c}(x_{1:n}) \cdot \mathrm{ctw}_{D-1}^{1c}(x_{1:n})

CTS generalizes this by switching between two "experts" at each node, using a time-varying mixture rather than a fixed weight scheme:

  • Expert 0: memoryless KT predictor ξKT\xi_{KT}
  • Expert 1: split-context predictor, whose one-step predictive gain is:

zDc(xnx<n)=ctsD10c(x1:n)ctsD10c(x<n)ctsD11c(x1:n)ctsD11c(x<n)z_D^c(x_n|x_{<n}) = \frac{\mathrm{cts}_{D-1}^{0c}(x_{1:n})}{\mathrm{cts}_{D-1}^{0c}(x_{<n})} \frac{\mathrm{cts}_{D-1}^{1c}(x_{1:n})}{\mathrm{cts}_{D-1}^{1c}(x_{<n})}

CTS maintains two nonnegative weights kck_c and scs_c per context cc, summing to ctsDc(x1:n)\mathrm{cts}_{D}^c(x_{1:n}), and a global or local sequence of switch rates, e.g., αnc=1/n\alpha_n^c = 1/n for all nodes.

3. Algorithmic Recursion and Update Scheme

At each symbol xnx_n, CTS traverses the context tree from root to leaf for c=ϕD(x<n)c = \phi_D(x_{<n}). The update scheme is as follows:

ctsDc(x1:n)kcξKT(xncx<nc)+sczDc(xnx<n) kcαn+1cctsDc(x1:n)+(12αn+1c)kcξKT(xncx<nc) scαn+1cctsDc(x1:n)+(12αn+1c)sczDc(xnx<n)\begin{align*} \mathrm{cts}_D^c(x_{1:n}) &\gets k_c \cdot \xi_{KT}(x^c_n|x^c_{<n}) + s_c \cdot z_D^c(x_n|x_{<n}) \ k_c &\gets \alpha_{n+1}^c \mathrm{cts}_D^c(x_{1:n}) + (1 - 2\alpha_{n+1}^c) k_c \xi_{KT}(x^c_n|x^c_{<n}) \ s_c &\gets \alpha_{n+1}^c \mathrm{cts}_D^c(x_{1:n}) + (1 - 2\alpha_{n+1}^c) s_c z_D^c(x_n|x_{<n}) \end{align*}

  • Base cases: for D=0D=0, cts0c()ξKT(x1:nc)\mathrm{cts}_0^c(\ldots) \equiv \xi_{KT}(x^c_{1:n}); new nodes are initialized kc=sc=1/2k_c = s_c = 1/2.

CTS thus implements an online mixture over "stay-with-expert" or "switch-expert" decision paths at each node, allowing the effective model to dynamically adapt context depth and composition. The per-symbol time complexity is O(D)O(D); storage is O(D)O(D) at fixed DD.

4. Theoretical Properties and Redundancy Bounds

CTS matches the computational complexity of CTW, at O(nD)O(nD) total update time and O(D)O(D) memory. It provides rigorous redundancy bounds for any binary stationary source representable as a PST of depth at most DD.

Let (S,ΘS)(S, \Theta_S) be a PST of depth D\leq D, S|S| leaves, and maximum context length d(S)d(S). The cumulative code-length satisfies:

log2ctsD(x1:n)ΓD(S)+[d(S)+1]log2n+Sγ(n/S)log2Pr(x1:nS,ΘS)- \log_2 \mathrm{cts}_D(x_{1:n}) \leq \Gamma_D(S) + [d(S)+1] \log_2 n + |S| \gamma(n/|S|) - \log_2 \Pr(x_{1:n}|S, \Theta_S)

where:

  • ΓD(S)\Gamma_D(S): CTW structure penalty,
  • γ(k)=12log2k+1\gamma(k) = \frac{1}{2}\log_2 k + 1,
  • log2Pr(x1:nS,ΘS)- \log_2 \Pr(x_{1:n}|S, \Theta_S): log-likelihood for tree SS with parameters ΘS\Theta_S.

Thus, wrapped in an arithmetic coder, the total redundancy is

ΓD(S)+[d(S)+1]log2n+Sγ(n/S)+2\Gamma_D(S) + [d(S)+1] \log_2 n + |S| \gamma(n/|S|) + 2

matching CTW up to an additive [d(S)+1]log2n[d(S) + 1]\log_2 n term—O(logn)O(\log n) optimality for fixed DD.

5. Empirical Evaluation and Comparative Performance

CTS and CTW were evaluated on the 14-file Calgary Corpus, with compression measured in average bits/byte using binary arithmetic coding. Several algorithmic variants were tested, with and without enhancements such as count-halving, binary decomposition, and zero-redundancy estimators.

Corpus CTW (D=48) CTS (D=48) CTW* (D=48) CTS* (D=48)
bib 2.25 2.23 1.83 1.79
book1 2.31 2.32 2.18 2.19
geo 5.01 5.05 4.53 4.18
  • Unenhanced CTS achieves up to 7% smaller compressed size than CTW, with no case worse than 1%. Enhanced CTS* with D=48D=48 provides up to 8% better compression than CTW*. Increasing tree depth to D=160D=160 with enhancements further reduces bits/byte (e.g., $1.31$ on "trans").
  • Weighted averages: PPM* ($2.09$), CTW* ($1.99$), PPMZ ($1.93$), CTS* ($1.93$), Deplump PPM variant ($1.89$). CTS* thereby matches PPMZ and closely approaches the leading Deplump variant, while maintaining universality and provable O(nD)O(nD) complexity (Veness et al., 2011).

6. Context and Significance

CTS broadens the model class over which effective mixture prediction is performed relative to CTW, incorporating arbitrary switching between context segmentation and memoryless prediction at each node. By preserving computational and asymptotic efficiency, and providing explicit redundancy guarantees, CTS is a robust universal coding method for binary, stationary contexts. Its empirical advantage is consistently demonstrated on standard benchmarks, and its theoretical regime includes all D-Markov sources representable by PSTs of bounded depth. This suggests CTS is well-suited for modern applications where adaptivity, universality, and provable performance are critical, such as high-throughput sequence modeling and online compression (Veness et al., 2011).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context Tree Switching (CTS).