Papers
Topics
Authors
Recent
Search
2000 character limit reached

MC-AIXI-CTW: A Practical AIXI Approximation

Updated 7 April 2026
  • The paper introduces a practical approximation to AIXI by combining ρ-UCT Monte Carlo planning with action-conditional CTW for Bayesian sequence prediction.
  • MC-AIXI-CTW is defined over bounded-depth prediction suffix trees, enabling near Bayes-optimal decision-making in partially observable and non-Markovian settings.
  • Empirical benchmarks demonstrate its robust performance across diverse domains, including POMDP mazes, adversarial games, and sequential decision problems.

MC-AIXI-CTW is a computationally feasible, general reinforcement learning agent that provides an explicit, practical approximation to the theoretical AIXI agent via Monte-Carlo Tree Search (MCTS) and an agent-specific extension of the Context Tree Weighting (CTW) algorithm. The MC-AIXI-CTW construction enables Bayes-optimal planning over a restricted but rich model class—bounded-depth prediction suffix trees—paired with efficient, anytime approximations to sequential decision-making in unknown, partially observable, and possibly non-Markovian environments. The framework is built on two algorithmic components: a history-based variant of Upper Confidence Trees (UCT) for planning (ρ-UCT), and a factored, action-conditional CTW mixture for online Bayesian sequence prediction. MC-AIXI-CTW was introduced and analyzed by Veness, Ng, Hutter, and colleagues, with comprehensive empirical benchmarking on diverse domains (0909.0801, Veness et al., 2010).

1. Theoretical Basis: From AIXI to MC-AIXI-CTW

The AIXI agent formalizes optimal sequential decision-making by maximizing expected reward under a Solomonoff universal prior over all computable environment models. The AIXI policy for horizon mm is defined as: at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}), where ξU\xi_U is the Solomonoff mixture: ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n}) with K(ν)K(\nu) denoting the Kolmogorov complexity of environment ν\nu.

Direct computation is infeasible, requiring intractable marginalization over all computable semimeasures and full-width expectimax planners. MC-AIXI-CTW introduces two tractable approximations:

  • Prediction: The universal mixture ξU\xi_U is approximated by a Bayesian mixture over all bounded-depth PSTs, via action-conditional CTW.
  • Planning: The full expectimax tree is replaced by ρ-UCT, a variant of MCTS employing UCB exploration and simulating only sampled trajectories relevant to reward maximization (Veness et al., 2010, 0909.0801).

2. Algorithmic Components

2.1 Monte Carlo Tree Search: ρ-UCT

The ρ-UCT algorithm maintains a partial search tree over histories, alternating between decision and chance nodes:

  • Selection: At decision nodes, actions aa are selected:

a=argmaxaA(V^(ha)m(βα)+ClnT(h)T(ha))a^* = \arg\max_{a \in A} \left( \frac{\hat{V}(ha)}{m(\beta-\alpha)} + C \sqrt{\frac{\ln T(h)}{T(ha)}} \right)

where V^(ha)\hat{V}(ha) is the empirical mean return, at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}),0 is the visit count, at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}),1 is the normalized reward range, and at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}),2 controls exploration.

  • Expansion: Unvisited child nodes are created upon first visit.
  • Simulation: Rollouts employ a default policy (typically random) to generate percept and reward samples from the generative model at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}),3 up to depth at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}),4.
  • Backpropagation: Returns are recursively averaged along the simulation path.

This process is anytime and computably efficient, providing near-optimal action selection when planning budget (number of simulations) is sufficient (Veness et al., 2010, 0909.0801).

2.2 Bayesian Sequence Prediction: Action-Conditional CTW

Action-conditional CTW enables online Bayesian averaging over the class of all binary PSTs of depth at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}),5. Key mechanisms:

  • Prediction Suffix Trees (PSTs): Each PST at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}),6 specifies a mapping from recent context bits of the action-percept sequence to Bernoulli distributions for the next bit.
  • Krichevsky–Trofimov Estimator: Each leaf context maintains counts at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}),7 and produces:

at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}),8

  • CTW Recursive Mixture: For any context tree node at=argmaxatxtmaxat+mxt+m[i=1mrt+i]ξU(x1:t+ma1:t+m),a_t^* = \arg\max_{a_{t}} \sum_{x_{t}} \cdots \max_{a_{t+m}} \sum_{x_{t+m}} \left[\sum_{i=1}^{m} r_{t+i}\right] \cdot \xi_U(x_{1:t+m} \mid a_{1:t+m}),9,

ξU\xi_U0

producing at the root, ξU\xi_U1, which is the Bayesian mixture over all PSTs up to depth ξU\xi_U2, each weighted by a structural prior penalty (Veness et al., 2010, 0909.0801).

  • Factoring for Non-Binary Alphabets: For larger alphabets, actions and percepts are encoded as bit-strings, and each bit is modelled by a separate context tree (properly depth-shifted to maintain conditional dependencies).

3. Integrated Agent Cycle and Key Equations

At each cycle, MC-AIXI-CTW:

  1. Encodes the current history ξU\xi_U3 as interleaved action-percept bits.
  2. Invokes ρ-UCT to simulate ξU\xi_U4 trajectories from ξU\xi_U5 using action-conditional CTW as the sampled generative model ξU\xi_U6 for percepts.
  3. Picks the empirically best action returned by UCB planning.
  4. Executes this action, observes the new percept, and updates the CTW trees accordingly (action bits appended, percept bits update the KT estimators).
  5. Prepares for the next cycle, repeating the above.

The agent thus implements an anytime, approximately Bayes-optimal planning loop over a class of bounded-memory, computable models (Veness et al., 2010, 0909.0801).

4. Computational Complexity and Resource Requirements

  • CTW update: ξU\xi_U7 per percept bit, with total per-cycle cost ξU\xi_U8 where ξU\xi_U9/ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n})0 are lengths in bits of action and percept encodings, respectively.
  • Generative sampling: Sampling one percept costs ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n})1; rolling out ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n})2 steps is ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n})3.
  • Monte Carlo planning: ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n})4 simulations per real cycle yields ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n})5 planning cost.
  • Space requirements: Action-conditional CTW trees grow lazily, with ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n})6 nodes after ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n})7 cycles; UCT tree size ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n})8.
  • Concurrency: Simulations are trivially parallelizable as ρ-UCT is a classical MCTS scheme (Veness et al., 2010, 0909.0801).

5. Empirical Results and Comparative Performance

MC-AIXI-CTW has been evaluated on a range of partially observable, stochastic, non-tabular environments, including:

  • Cheese Maze (POMDP maze)
  • Tiger problem (classic POMDP)
  • 4×4 Grid navigation
  • TicTacToe (self-play)
  • Biased Rock-Paper-Scissors (adversarial)
  • Kuhn Poker (imperfect information)
  • Partially observable Pacman (Veness et al., 2010, 0909.0801)

Empirical findings include:

  • Near-optimal average reward in problems with known optimal policies, converging within ξU(x1:na1:n)=ν2K(ν)ν(x1:na1:n)\xi_U(x_{1:n} \mid a_{1:n}) = \sum_{\nu} 2^{-K(\nu)} \nu(x_{1:n} \mid a_{1:n})9–K(ν)K(\nu)0 cycles.
  • Outperformance of Lempel–Ziv-based competitor (Active-LZ) and U-Tree (splitter-based), both in learning speed and achievable asymptotic reward.
  • Robustness to planning budget, with hundreds to low thousands of simulations sufficing for small to moderate domains.

Resource usage, learning curves, and convergence trajectories consistently support the efficacy of a joint MCTS-CTW approach in environments exhibiting significant partial observability and memory requirements (Veness et al., 2010, 0909.0801).

6. Limitations and Prospects for Extension

Known limitations include:

  • Restriction to the class of bounded-depth PSTs limits expressivity for richly structured, high-dimensional perceptual streams.
  • Planning horizon K(ν)K(\nu)1 constrains information-gathering and the capacity to plan for distant-delayed rewards.
  • Outer exploration is often needed in practical settings, requiring strategies such as K(ν)K(\nu)2-greedy or softmax action selection layered on top of the ρ-UCT core.

Identified directions for improvement comprise:

  • Generalization of context selection via predicates or feature-based context trees to increase model expressivity.
  • Model composition by convex mixture with other efficient predictors such as Lempel–Ziv or hand-crafted experts.
  • Incorporation of rollout policy learning to boost MCTS efficiency (e.g., via bandit learning or CTW on rollouts).
  • Extending CTW methodology to real-valued and high-dimensional percepts via quantization or mixture models.
  • Hardware-based scaling through parallel simulation or GPU acceleration (0909.0801).

7. Context and Impact

MC-AIXI-CTW represents the first practically computable, sample-efficient agent descending from the theoretical AIXI ideal. Its construction combines rigorous Bayesian learning via action-conditional CTW and powerful, incremental planning via MCTS (ρ-UCT), yielding provable consistency in finite-memory models and strong empirical performance on tasks inaccessible to classic tabular or parametric RL approaches. The architecture established a foundation for subsequent research in general reinforcement learning under agent ignorance, and remains a central algorithmic reference for approaches seeking to bridge the gap between universal optimality and feasible computation (Veness et al., 2010, 0909.0801).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MC-AIXI-CTW.