Implicit Search via Discrete Diffusion: A Study on Chess (2502.19805v1)

Published 27 Feb 2025 in cs.LG and cs.AI

Abstract: In the post-AlphaGo era, there has been a renewed interest in search techniques such as Monte Carlo Tree Search (MCTS), particularly in their application to LLMs. This renewed attention is driven by the recognition that current next-token prediction models often lack the ability for long-term planning. Is it possible to instill search-like abilities within the models to enhance their planning abilities without relying on explicit search? We propose DiffuSearch , a model that does \textit{implicit search} by looking into the future world via discrete diffusion modeling. We instantiate DiffuSearch on a classical board game, Chess, where explicit search is known to be essential. Through extensive controlled experiments, we show DiffuSearch outperforms both the searchless and explicit search-enhanced policies. Specifically, DiffuSearch outperforms the one-step policy by 19.2% and the MCTS-enhanced policy by 14% on action accuracy. Furthermore, DiffuSearch demonstrates a notable 30% enhancement in puzzle-solving abilities compared to explicit search-based policies, along with a significant 540 Elo increase in game-playing strength assessment. These results indicate that implicit search via discrete diffusion is a viable alternative to explicit search over a one-step policy. All codes are publicly available at \href{https://github.com/HKUNLP/DiffuSearch}{https://github.com/HKUNLP/DiffuSearch}.

Summary

The paper introduces DiffuSearch, a discrete diffusion method for implicit search that achieves substantial performance gains over explicit search and searchless policies in chess, including a 19.2% action accuracy increase and a 540 Elo rating boost.
DiffuSearch contrasts with explicit search like MCTS by using discrete diffusion to implicitly model multi-step future states and actions between the policy and environment to inform next actions.
Using chess as a complex testbed, the study demonstrates that implicit future state generation via discrete diffusion offers a viable alternative to explicit search, potentially generalizable to other sequential decision-making tasks.

The paper introduces DiffuSearch, a novel approach to instill search-like abilities within models to enhance their planning without relying on explicit search. The method employs discrete diffusion modeling to implicitly explore future states, and it is evaluated on the game of Chess.

The key aspects of the approach and findings include:

The DiffuSearch model outperforms both searchless and explicit search-enhanced policies. On action accuracy, DiffuSearch outperforms the one-step policy by 19.2\% and the Monte Carlo Tree Search (MCTS)-enhanced policy by 14\%. Additionally, the model shows a 30\% improvement in puzzle-solving abilities compared to explicit search-based policies, alongside a 540 Elo increase in game-playing strength assessment.
The paper contrasts explicit search via MCTS with implicit search via discrete diffusion. MCTS iteratively refines decisions by explicitly performing action selection, state evaluation, and value backup. Discrete diffusion implicitly gathers future information during future state generation to improve the next action.
The method represents the future using multi-step interaction information between the policy and the environment, specifically states and actions. This representation facilitates the learning and prediction of future states, which shares similarities with implicit searching in the future.
The paper uses the chess-playing task as a testbed, citing its reliance on explicit search. This controlled environment allows for a deep dive into different methods of representing and learning the future. The techniques learned in this task may be generalizable to natural-language settings.
The authors propose representing the future with multi-step interaction information between policy and the world (states and actions), such that the generation of the future shares similar spirits as implicit searching in the future world.
The alternating Markov games framework is adopted, defining state space $S$ , action space $A$ , state transition function $f(s, a)$ , and reward functions $r^0(s)$ and $r^1(s)$ . Policies $p(a|s)$ and value functions $v^p(s)$ are defined to formalize the chess-playing problem.
Discrete diffusion models are employed, characterized by forward $q(x_{1:T}|x_0)$ $q (x_{1 : T} ∣ x_{0})$ and backward $p_\theta(x_{0:T})$ $p_{θ} (x_{0 : T})$ Markov processes. The variational lower bound $L_{vb}$ $L_{v b}$ is optimized:

$L_{vb} = E_{q(x_0)}[D_{KL}[q(x_T | x_0) || p(x_T)] + \sum_{t=2}^T E_{q(x_t|x_0)}[D_{KL}[q(x_{t-1} | x_t, x_0) || p_\theta(x_{t-1}|x_t)] - E_{q(x_1|x_0)}[\log p_\theta(x_0|x_1)]]$

Where:
- $x_0$ is the initial discrete random variable.
- $x_t$ represents latent variables at timestep $t$ .
- $q$ and $p_\theta$ denote the forward and backward Markov processes, respectively.
- $D_{KL}$ is the Kullback-Leibler divergence.
The policy distribution at state $s_i$ $s_{i}$ considering the future is given by:

$p_\theta(a_i, s_{i+1}, a_{i+1}, ..., s_{i+h-1}, a_{i+h-1} | s_i)$

Where:
- $h>1$ is the future horizon.
Supervised training is conducted using Stockfish 16 to approximate the optimal policy $\pi^*$ with $\pi^{SF}$ , obtaining actions $a^{SF}_j = \argmax_{a_j} Q^{SF}(s_j, a_j)$ . A dataset $D = \{(s_i, (a^{SF}_i, s_{i+1}, a^{SF}_{i+1}, ..., s_{i+h-1}, a^{SF}_{i+h-1}))\}$ is constructed, where the oracle future path represents the move with the maximum evaluation for the best opponent's reply.
A simplified training objective is used. The KL term $D_{KL}[q(x_{t-1} | x_t, x_0) || p_\theta(x_{t-1}|x_t)]$ is simplified as $-\lambda_t 1_{x_t \neq x_0} x_0^T \log f(x_t;\theta)$ .
During inference, DiffuSearch takes $\argmax_{a_i} p_\theta(a_i|s_i)$ and requires marginalizing over all future with horizon $h$ , which is intractable due to the exponential-growing search space. Therefore, the authors resort to $\argmax_{a_i, z_i} p_\theta(a_i, z_i|s_i)$ , achieved by sampling from the trained model. An easy-first decoding strategy is adopted during diffusion sampling, resetting tokens with the least predictive log-likelihood to the noise state.
Three metrics are used to evaluate the policies: 1) Action Accuracy; 2) Puzzle Accuracy; 3) Tournament Elo.
The S-ASA approach is identified as an effective modeling paradigm. The quality of future world dynamics is crucial. Ensuring the validity of future world dynamics is crucial.
Proper discrete diffusion modeling enhances performance.
The paper analyzes prediction quality at different future steps, the impact of scaling self-attention layers and the effect of increasing diffusion timesteps.
Explicit search using MCTS is compared with implicit search in DiffuSearch and demonstrates the trade-offs between them.

The paper shows that DiffuSearch offers a viable alternative to explicit search by learning to implicitly model future states and actions, thereby improving decision-making in complex tasks.