MC-AIXI-CTW: A Practical AIXI Approximation
- The paper introduces a practical approximation to AIXI by combining ρ-UCT Monte Carlo planning with action-conditional CTW for Bayesian sequence prediction.
- MC-AIXI-CTW is defined over bounded-depth prediction suffix trees, enabling near Bayes-optimal decision-making in partially observable and non-Markovian settings.
- Empirical benchmarks demonstrate its robust performance across diverse domains, including POMDP mazes, adversarial games, and sequential decision problems.
MC-AIXI-CTW is a computationally feasible, general reinforcement learning agent that provides an explicit, practical approximation to the theoretical AIXI agent via Monte-Carlo Tree Search (MCTS) and an agent-specific extension of the Context Tree Weighting (CTW) algorithm. The MC-AIXI-CTW construction enables Bayes-optimal planning over a restricted but rich model class—bounded-depth prediction suffix trees—paired with efficient, anytime approximations to sequential decision-making in unknown, partially observable, and possibly non-Markovian environments. The framework is built on two algorithmic components: a history-based variant of Upper Confidence Trees (UCT) for planning (ρ-UCT), and a factored, action-conditional CTW mixture for online Bayesian sequence prediction. MC-AIXI-CTW was introduced and analyzed by Veness, Ng, Hutter, and colleagues, with comprehensive empirical benchmarking on diverse domains (0909.0801, Veness et al., 2010).
1. Theoretical Basis: From AIXI to MC-AIXI-CTW
The AIXI agent formalizes optimal sequential decision-making by maximizing expected reward under a Solomonoff universal prior over all computable environment models. The AIXI policy for horizon is defined as: where is the Solomonoff mixture: with denoting the Kolmogorov complexity of environment .
Direct computation is infeasible, requiring intractable marginalization over all computable semimeasures and full-width expectimax planners. MC-AIXI-CTW introduces two tractable approximations:
- Prediction: The universal mixture is approximated by a Bayesian mixture over all bounded-depth PSTs, via action-conditional CTW.
- Planning: The full expectimax tree is replaced by ρ-UCT, a variant of MCTS employing UCB exploration and simulating only sampled trajectories relevant to reward maximization (Veness et al., 2010, 0909.0801).
2. Algorithmic Components
2.1 Monte Carlo Tree Search: ρ-UCT
The ρ-UCT algorithm maintains a partial search tree over histories, alternating between decision and chance nodes:
- Selection: At decision nodes, actions are selected:
where is the empirical mean return, 0 is the visit count, 1 is the normalized reward range, and 2 controls exploration.
- Expansion: Unvisited child nodes are created upon first visit.
- Simulation: Rollouts employ a default policy (typically random) to generate percept and reward samples from the generative model 3 up to depth 4.
- Backpropagation: Returns are recursively averaged along the simulation path.
This process is anytime and computably efficient, providing near-optimal action selection when planning budget (number of simulations) is sufficient (Veness et al., 2010, 0909.0801).
2.2 Bayesian Sequence Prediction: Action-Conditional CTW
Action-conditional CTW enables online Bayesian averaging over the class of all binary PSTs of depth 5. Key mechanisms:
- Prediction Suffix Trees (PSTs): Each PST 6 specifies a mapping from recent context bits of the action-percept sequence to Bernoulli distributions for the next bit.
- Krichevsky–Trofimov Estimator: Each leaf context maintains counts 7 and produces:
8
- CTW Recursive Mixture: For any context tree node 9,
0
producing at the root, 1, which is the Bayesian mixture over all PSTs up to depth 2, each weighted by a structural prior penalty (Veness et al., 2010, 0909.0801).
- Factoring for Non-Binary Alphabets: For larger alphabets, actions and percepts are encoded as bit-strings, and each bit is modelled by a separate context tree (properly depth-shifted to maintain conditional dependencies).
3. Integrated Agent Cycle and Key Equations
At each cycle, MC-AIXI-CTW:
- Encodes the current history 3 as interleaved action-percept bits.
- Invokes ρ-UCT to simulate 4 trajectories from 5 using action-conditional CTW as the sampled generative model 6 for percepts.
- Picks the empirically best action returned by UCB planning.
- Executes this action, observes the new percept, and updates the CTW trees accordingly (action bits appended, percept bits update the KT estimators).
- Prepares for the next cycle, repeating the above.
The agent thus implements an anytime, approximately Bayes-optimal planning loop over a class of bounded-memory, computable models (Veness et al., 2010, 0909.0801).
4. Computational Complexity and Resource Requirements
- CTW update: 7 per percept bit, with total per-cycle cost 8 where 9/0 are lengths in bits of action and percept encodings, respectively.
- Generative sampling: Sampling one percept costs 1; rolling out 2 steps is 3.
- Monte Carlo planning: 4 simulations per real cycle yields 5 planning cost.
- Space requirements: Action-conditional CTW trees grow lazily, with 6 nodes after 7 cycles; UCT tree size 8.
- Concurrency: Simulations are trivially parallelizable as ρ-UCT is a classical MCTS scheme (Veness et al., 2010, 0909.0801).
5. Empirical Results and Comparative Performance
MC-AIXI-CTW has been evaluated on a range of partially observable, stochastic, non-tabular environments, including:
- Cheese Maze (POMDP maze)
- Tiger problem (classic POMDP)
- 4×4 Grid navigation
- TicTacToe (self-play)
- Biased Rock-Paper-Scissors (adversarial)
- Kuhn Poker (imperfect information)
- Partially observable Pacman (Veness et al., 2010, 0909.0801)
Empirical findings include:
- Near-optimal average reward in problems with known optimal policies, converging within 9–0 cycles.
- Outperformance of Lempel–Ziv-based competitor (Active-LZ) and U-Tree (splitter-based), both in learning speed and achievable asymptotic reward.
- Robustness to planning budget, with hundreds to low thousands of simulations sufficing for small to moderate domains.
Resource usage, learning curves, and convergence trajectories consistently support the efficacy of a joint MCTS-CTW approach in environments exhibiting significant partial observability and memory requirements (Veness et al., 2010, 0909.0801).
6. Limitations and Prospects for Extension
Known limitations include:
- Restriction to the class of bounded-depth PSTs limits expressivity for richly structured, high-dimensional perceptual streams.
- Planning horizon 1 constrains information-gathering and the capacity to plan for distant-delayed rewards.
- Outer exploration is often needed in practical settings, requiring strategies such as 2-greedy or softmax action selection layered on top of the ρ-UCT core.
Identified directions for improvement comprise:
- Generalization of context selection via predicates or feature-based context trees to increase model expressivity.
- Model composition by convex mixture with other efficient predictors such as Lempel–Ziv or hand-crafted experts.
- Incorporation of rollout policy learning to boost MCTS efficiency (e.g., via bandit learning or CTW on rollouts).
- Extending CTW methodology to real-valued and high-dimensional percepts via quantization or mixture models.
- Hardware-based scaling through parallel simulation or GPU acceleration (0909.0801).
7. Context and Impact
MC-AIXI-CTW represents the first practically computable, sample-efficient agent descending from the theoretical AIXI ideal. Its construction combines rigorous Bayesian learning via action-conditional CTW and powerful, incremental planning via MCTS (ρ-UCT), yielding provable consistency in finite-memory models and strong empirical performance on tasks inaccessible to classic tabular or parametric RL approaches. The architecture established a foundation for subsequent research in general reinforcement learning under agent ignorance, and remains a central algorithmic reference for approaches seeking to bridge the gap between universal optimality and feasible computation (Veness et al., 2010, 0909.0801).