XoT Framework: RL, MCTS & LLM Integration

Updated 24 November 2025

Everything of Thoughts (XoT) is a hybrid decision-making and reasoning framework that integrates RL, MCTS, and LLMs to optimize performance, efficiency, and flexibility in multi-step problem solving.
XoT employs a three-phase architecture—RL-guided pretraining, MCTS-guided search, and LLM-based synthesis—to efficiently generate high-quality, unconstrained cognitive mappings with minimal LLM queries.
The framework overcomes the traditional performance-efficiency-flexibility trade-off seen in CoT, ToT, and GoT paradigms, achieving state-of-the-art results on challenging combinatorial tasks.

Everything of Thoughts (XoT) is a hybrid decision-making and reasoning framework that integrates reinforcement learning (RL), Monte Carlo Tree Search (MCTS), and LLMs to address the trade-offs inherent in classical thought prompting schemes. XoT is explicitly designed to simultaneously optimize performance, efficiency, and flexibility in multi-step, multi-solution problem-solving, surpassing the limitations of established paradigms such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Graph-of-Thought (GoT). Through a synergistic architecture that decouples combinatorial planning from natural language reasoning, XoT enables LLMs to produce high-quality, unconstrained cognitive mappings for complex tasks with minimal LLM interaction and native support for multi-solution scenarios (Ding et al., 2023).

1. The Penrose Triangle of Thought Generation and Motivation

XoT is motivated by an observed trade-off encountered by existing thought-prompting schemes, best conceptualized as a "Penrose triangle"—where any given method can robustly realize at most two of the following: high performance, high efficiency, and high flexibility. Specifically:

Performance: Reliable, accurate solution generation
Efficiency: Minimal LLM queries (cost and latency)
Flexibility: Ability to traverse non-linear, graph- or tree-structured or multi-solution thought spaces

Classical approaches exhibit the following profile:

Paradigm	Performance	Efficiency	Flexibility
IO	✘	✓	✘
CoT	✘–✓	✓	✘
Self-Consistent CoT	✓	✘	✘
ToT	✓	✘	✓ (tree-structured)
GoT	✓	✘	✓ (general graph)
XoT	✓	✓	✓ (arbitrary graph, multi-soln)

Previous linearly-structured methods (CoT) lack flexibility and multi-solution ability, while tree- and graph-based LLM-driven search requires excessive LLM queries. XoT circumvents this by learning external models (via RL and MCTS), using them to efficiently sample high-probability latent trajectories, and only invoking the LLM a minimal number of times for natural language thought synthesis and evaluation. This breaks the prior no-free-lunch constraint observed in the field (Ding et al., 2023).

2. XoT Architecture and Algorithmic Workflow

The XoT framework comprises three tightly integrated phases: RL-guided pretraining, MCTS-guided search, and LLM-based answer synthesis and revision.

2.1 RL Pretraining and MDP Formalization

MDP Structure: The reasoning process is modeled as a Markov Decision Process $(s_0, a_0 \rightarrow s_1, a_1 \rightarrow \cdots \rightarrow s_T)$ , where each state $s_t$ encodes the partial progress on the task (e.g., remaining operations, current puzzle configuration), and action $a_t$ encodes a concrete primitive "thought" (e.g., applying an operator, performing a state transition).
Policy/Value Network $f_\theta$ : A small, domain-adapted policy/value MLP is pre-trained via RL and self-play MCTS (in "AlphaZero" style). $f_\theta(s)$ outputs a prior $P_\theta(s)$ over valid actions and a scalar value $v_\theta(s)$ . Network size is typically $10^6$ parameters.
Reward Specification: Terminal states are rewarded by task-specific criteria (e.g., +1 for a solved puzzle, minus the distance to goal otherwise).

2.2 MCTS-Guided Inference

At inference, the pretrained $f_\theta$ guides an MCTS which efficiently explores the latent solution space:

for sim = 1…K:
    node = root
    # Selection
    while node is expanded:
        a* = argmax_{a} [Q(node,a) + w·P_θ(node,a)·sqrt(N(node)/(1+N(node,a)))]
        node = node.child[a*]
    # Expansion & Evaluation
    if not terminal(node.state):
        expand node by legal actions
        P_prior, v_estimate = f_θ(node.state)
    # Backpropagation
    reward = (goal reached)? true_reward : v_estimate
    backpropagate reward

After $K$ simulations, the highest-visit or sampled trajectories are converted into textual "thought scripts" for LLM evaluation.

2.3 LLM Integration and Revision

Thought-to-Prompt Conversion: Each trajectory is serialized as a sequence of textual thoughts, representing each action and its state transition, and concatenated in a single prompt batch (for multi-solution support).
LLM Revision Loop: The LLM is invoked to synthesize final answers, optionally identifying and marking incorrect steps for local MCTS-based resampling and revision. Each iteration usually requires only one extra LLM call, significantly reducing LLM utilization relative to pure ToT or GoT strategies (Ding et al., 2023).

3. Multi-Solution Capability and Unconstrained Cognitive Mapping

XoT differs substantially from classical paradigms by natively supporting the simultaneous generation and revision of multiple solutions. At the conclusion of MCTS, a sampling distribution $\epsilon_a(s) \propto N(s,a)^{1/\gamma}$ over actions at each step of the search allows for sampling $M$ distinct plausible trajectories. All are then submitted to the LLM in a single interaction, yielding a batch of diverse solutions and supporting creative, open-ended reasoning by construction.

This multi-solution approach extends cognitive flexibility beyond what is possible with CoT (strictly linear, single-solution) or ToT (tree-structured, but still incurring high LLM cost). Complexity is managed by focusing on the most-visited branches, making the approach computationally tractable even as the potential solution count grows combinatorially (Ding et al., 2023).

4. Empirical Evaluation and Comparative Performance

XoT achieves state-of-the-art results on several combinatorial and multi-solution tasks. Consider the following summary of results under GPT-4, comparing XoT (with up to three collaborative MCTS-LLM revisions) against canonical baselines (Ding et al., 2023):

Task	IO	ToT (b=3)	MCTS	XoT (1 rev)	XoT (3 rev)	LLM calls
Game of 24	10.2%	60.6%	62.8%	74.5%	85.4%	1.38-1.78
8-Puzzle	1.7%	13.5%	51.3%	93.3%	95.8%	1.48-1.61
Pocket Cube	1.1%	19.6%	46.4%	77.6%	83.6%	1.54-2.00

Key patterns: In all tasks, XoT outperforms pure MCTS, ToT, CoT, and GoT, with comparable or lower LLM call counts (<2 calls on average).
Multi-solution metric: In multi-solution mode, XoT achieves MultiAcc ≈ 76–80% (proportion of problems with ≥1 correct answer among up to 3 solutions) with only ~2 LLM calls, vastly exceeding ToT/GoT performance at the same cost (Ding et al., 2023).

5. Integration with Diverse Reasoning and Relation to Other X-of-Thoughts Paradigms

While XoT (in the sense of (Ding et al., 2023)) centers on RL/MCTS-augmented multi-path reasoning, other integrated frameworks such as the "Plan, Verify and Switch" XoT (plan–reason–verify with method switching) (Liu et al., 2023) employ a different strategy for harnessing the complementarity of existing reasoning styles (chain-of-thought, program-of-thought, equation-of-thought, etc.). That system constructs an outer loop for method selection ( $S(m|q)$ ), reasoning generation, and verification (active and passive), enabling dynamic switching between prompt patterns until a verified solution emerges. In contrast, XoT (Ding et al., 2023) achieves unconstrained multi-path, multi-solution exploration via MCTS and RL, followed by minimal LLM generation for answer synthesis.

Furthermore, recent advances in cross-lingual Tree-of-Thoughts extend the X-of-Thoughts paradigm to multilingual reasoning, but do not address the efficiency-flexibility-performance trilemma that XoT explicitly solves (Ranaldi et al., 2023).

6. Implementation Aspects, Limitations, and Future Directions

RL pretraining overhead: Satisfactory XoT performance currently relies on constructing accurate domain simulators and training $f_\theta$ , requiring additional engineering for each domain.
Model error and revision: If $f_\theta$ is misspecified, erroneous trajectories may result, but the LLM revision loop successfully catches and repairs most errors (>60% correction rate observed in practice).
Domain applicability: XoT is most effective on well-specified combinatorial tasks; generalization to open-ended NLP scenarios (e.g., summarization) remains an open research challenge, requiring surrogate state/action/reward models potentially realizable via LLM-based critics or meta-learning extensions.
Research directions: Meta-learning of $f_\theta$ , integration of non-MCTS planners (policy gradients, heuristic search), automated textual state/action extraction, and calibration of LLM revision strategies are ongoing topics (Ding et al., 2023).

7. Significance and Broader Context

XoT embodies a substantial methodological advance by injecting externally trained world models into LLM-driven reasoning, thereby defying the previously-accepted performance/efficiency/flexibility constraints of thought prompting. Its architecture enables creative, multi-solution cognitive mapping with a low number of LLM interactions. As a paradigm, Everything of Thoughts is both a practical solution for challenging combinatorial problem-solving and a conceptual template for future integrated planning-reasoning systems operating across diverse reasoning domains and modalities (Ding et al., 2023, Liu et al., 2023, Ranaldi et al., 2023).