Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models (2505.10446v2)

Published 15 May 2025 in cs.CL

Abstract: We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion LLMs. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion LLMs (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion LLM -- LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.

Summary

The paper introduces DCoLT to optimize latent thinking trajectories in diffusion language models using outcome-based reinforcement learning.
It enables non-linear, bidirectional reasoning that outperforms traditional sequential chain-of-thought methods on tasks like math and code generation.
Empirical results from DCoLT-SEDD and DCoLT-LLaDA models show significant performance gains over baselines, validating the lateral thinking approach.

This paper introduces the Diffusion Chain of Lateral Thought (DCoLT), a novel reasoning framework designed for Diffusion LLMs (DLMs). DCoLT aims to enhance complex reasoning by treating the intermediate steps of the reverse diffusion process as a latent "thinking" trajectory. This entire trajectory is optimized using outcome-based Reinforcement Learning (RL), where rewards are given based on the correctness of the final answer, without explicit supervision for intermediate steps. This approach encourages the model to explore diverse, non-linear, and creative thought processes.

Core Concepts and Motivation:

Lateral Thinking vs. Vertical Thinking: Traditional Chain-of-Thought (CoT) methods in autoregressive models follow a linear, sequential (vertical) thinking process. DCoLT, in contrast, enables lateral thinking in DLMs. This means reasoning can be bidirectional, non-linear (tokens generated not strictly left-to-right), and format-free (intermediate steps don't need to be grammatically perfect or complete).
DLMs for Lateral Thinking: The inherent nature of DLMs, which generate tokens in parallel and refine them over multiple steps, is well-suited for emulating this lateral thinking process. Each token can attend to all others (non-causal attention), allowing for global refinement.

Methodology - DCoLT Implementation:

The paper implements and evaluates DCoLT on two representative types of DLMs:

Continuous-Time DLM (SEDD):
- SEDD (Score-based Estimation of a Discrete Denoising model) is used as an example.
- The model learns to predict a "concrete score," which helps define a probabilistic policy for sampling tokens at each diffusion step.
- The action probability for RL, $\pi_{\theta,n}(x_n|x_{n-1})$ , is derived from this concrete score and the transition rate matrix $Q_t$ of the diffusion process.
- The entire sequence of these probabilistic actions is reinforced based on the final outcome.
- The policy for generating $x_n$ given $x_{n-1}$ is defined as:
  
  $\pi_{\theta,n}(x_n|x_{n-1}) = \prod_{i=1}^{|x_n|} p_{\theta,t_n}(x_{n}^i|x_{n-1})$
  
  where $p_{\theta,t_n}(x_{n}^i|x_{n-1})$ depends on the learned score $s_\theta$ and the transition rate matrix $Q_{t_{n-1}}$ .
Discrete-Time Masked DLM (LLaDA):
- LLaDA (Large Language Diffusion model) is a masked diffusion model.
- DCoLT for LLaDA involves a two-part action at each denoising step $n$ $n$ :
  1. Unmasking Policy: An Unmasking Policy Module (UPM) is introduced. UPM predicts a ranking score $h_{\theta,n}^i$ for each masked token. Based on these scores, $K$ tokens to unmask ( $\mathcal{U}_n$ ) are sampled using the Plackett-Luce model. The UPM is a single transformer block that takes hidden states and uses adaptive layer normalization to incorporate the step index $n$ and mask indicators.
  2. Token Prediction Policy: Once $\mathcal{U}_n$ is determined, LLaDA predicts the values of these unmasked tokens.
- The overall policy for transitioning from $x_{n-1}$ to $x_n$ is:
  
  $\pi_{\theta,n}(x_n|x_{n-1})=\pi_{\theta,n}^\text{unmask}(\mathcal{U}_n | x_{n-1})\cdot \pi_{\theta,n}^\text{token}(x_n|x_{n-1}, \mathcal{U}_n)$
  
  where $\pi_{\theta,n}^\text{unmask}$ is the Plackett-Luce probability for selecting $\mathcal{U}_n$ , and $\pi_{\theta,n}^\text{token}$ is the product of probabilities for predicting tokens in $\mathcal{U}_n$ .

Reinforcement Learning Framework:

Algorithm: Algorithm 1 provides a general framework for training DCoLT. It involves generating trajectories of thought $x_{0:N}$ , calculating rewards $r^g$ for the final output $x_N^g$ , computing advantages $A^g$ , and then updating model parameters $\theta$ by accumulating gradients from each step's policy loss $\mathcal{L}_{\theta,n}$ . The GRPO algorithm is used for RL optimization.
Reward: A rule-based reward is used (1 for correct, 0 for incorrect final answer).

Experiments and Results:

DCoLT was evaluated on math reasoning (GSM8K, MATH, Sudoku) and code generation (MBPP, HumanEval) tasks.

DCoLT-SEDD (400M model):
- Outperformed SFT-based CoT (GPT2) and DoT (SEDD) significantly.
- Achieved 96.2% on Sudoku $4\times4$ (vs. 79.4% for SEDD+DoT) and 57.0% on GSM8K-Aug (vs. 53.5% for SEDD+DoT).
- Analysis (Appendix A) showed DCoLT-SEDD learned an "easy-to-hard" progressive generation strategy for Sudoku and a non-linear, sample-dependent generation order for GSM8K-Aug, unlike the more rigid patterns of baselines.
DCoLT-LLaDA (8B model):
- Achieved state-of-the-art results among existing DLMs.
- GSM8K: 88.1% (+9.8% over LLaDA baseline)
- MATH: 44.6% (+5.7% over LLaDA baseline)
- MBPP: 51.6% (+11.4% over LLaDA baseline)
- HumanEval: 59.1% (+19.5% over LLaDA baseline)
- These results were achieved using only public data and 16 H800 GPUs, outperforming other DLMs trained with SFT or RL, and even competitive with autoregressive models trained with significantly more proprietary data.
- Ablation studies confirmed the importance of the UPM and the benefit of adaptive layernorm within it.
- Longer generation sequence lengths at inference time (without retraining) improved performance on MATH, with further gains if fine-tuned on longer sequences.
- Analysis (Appendix B) visualized the thinking process, showing key information (numbers, symbols) emerging early, surrounded by grammatically connecting text later. The UPM's scores effectively guided the unmasking of more confident tokens.

Key Contributions:

Introduction of DCoLT: A novel reasoning paradigm that leverages the reverse diffusion process for lateral thinking, optimized via outcome-based RL.
Demonstration of Lateral Thinking: Shows that DLMs can be trained to reason non-linearly and bidirectionally, often generating key concepts before filling in details.
UPM for Masked DLMs: A learnable Unmasking Policy Module for discrete-time masked DLMs that optimizes the order of token generation/unmasking.
Strong Empirical Results: Significant performance improvements on complex reasoning tasks (math, code) for both continuous and discrete-time DLMs, showcasing the effectiveness and data efficiency of DCoLT.

Limitations:

Current models are trained with limited data and compute, suggesting room for improvement with more resources.
DCoLT has only been validated on tasks with objectively verifiable reward functions. Extending it to tasks requiring subjective rewards (via a learned reward model) is future work.

In summary, DCoLT presents a promising direction for enhancing the reasoning capabilities of diffusion LLMs by explicitly reinforcing their multi-step, non-linear generation process as a "chain of lateral thought." The method allows models to explore more diverse and flexible reasoning paths, leading to improved performance on challenging tasks.

PDF Markdown

Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models (2505.10446v2)

Summary

Related Papers