- The paper introduces DCoLT to optimize latent thinking trajectories in diffusion language models using outcome-based reinforcement learning.
- It enables non-linear, bidirectional reasoning that outperforms traditional sequential chain-of-thought methods on tasks like math and code generation.
- Empirical results from DCoLT-SEDD and DCoLT-LLaDA models show significant performance gains over baselines, validating the lateral thinking approach.
This paper introduces the Diffusion Chain of Lateral Thought (DCoLT), a novel reasoning framework designed for Diffusion LLMs (DLMs). DCoLT aims to enhance complex reasoning by treating the intermediate steps of the reverse diffusion process as a latent "thinking" trajectory. This entire trajectory is optimized using outcome-based Reinforcement Learning (RL), where rewards are given based on the correctness of the final answer, without explicit supervision for intermediate steps. This approach encourages the model to explore diverse, non-linear, and creative thought processes.
Core Concepts and Motivation:
- Lateral Thinking vs. Vertical Thinking: Traditional Chain-of-Thought (CoT) methods in autoregressive models follow a linear, sequential (vertical) thinking process. DCoLT, in contrast, enables lateral thinking in DLMs. This means reasoning can be bidirectional, non-linear (tokens generated not strictly left-to-right), and format-free (intermediate steps don't need to be grammatically perfect or complete).
- DLMs for Lateral Thinking: The inherent nature of DLMs, which generate tokens in parallel and refine them over multiple steps, is well-suited for emulating this lateral thinking process. Each token can attend to all others (non-causal attention), allowing for global refinement.
Methodology - DCoLT Implementation:
The paper implements and evaluates DCoLT on two representative types of DLMs:
- Continuous-Time DLM (SEDD):
- SEDD (Score-based Estimation of a Discrete Denoising model) is used as an example.
- The model learns to predict a "concrete score," which helps define a probabilistic policy for sampling tokens at each diffusion step.
- The action probability for RL, πθ,n(xn∣xn−1), is derived from this concrete score and the transition rate matrix Qt of the diffusion process.
- The entire sequence of these probabilistic actions is reinforced based on the final outcome.
The policy for generating xn given xn−1 is defined as:
πθ,n(xn∣xn−1)=i=1∏∣xn∣pθ,tn(xni∣xn−1)
where pθ,tn(xni∣xn−1) depends on the learned score sθ and the transition rate matrix Qtn−1.
- Discrete-Time Masked DLM (LLaDA):
- LLaDA (Large Language Diffusion model) is a masked diffusion model.
- DCoLT for LLaDA involves a two-part action at each denoising step n:
- Unmasking Policy: An Unmasking Policy Module (UPM) is introduced. UPM predicts a ranking score hθ,ni for each masked token. Based on these scores, K tokens to unmask (Un) are sampled using the Plackett-Luce model. The UPM is a single transformer block that takes hidden states and uses adaptive layer normalization to incorporate the step index n and mask indicators.
- Token Prediction Policy: Once Un is determined, LLaDA predicts the values of these unmasked tokens.
The overall policy for transitioning from xn−1 to xn is:
πθ,n(xn∣xn−1)=πθ,nunmask(Un∣xn−1)⋅πθ,ntoken(xn∣xn−1,Un)
where πθ,nunmask is the Plackett-Luce probability for selecting Un, and πθ,ntoken is the product of probabilities for predicting tokens in Un.
Reinforcement Learning Framework:
- Algorithm: Algorithm 1 provides a general framework for training DCoLT. It involves generating trajectories of thought x0:N, calculating rewards rg for the final output xNg, computing advantages Ag, and then updating model parameters θ by accumulating gradients from each step's policy loss Lθ,n. The GRPO algorithm is used for RL optimization.
- Reward: A rule-based reward is used (1 for correct, 0 for incorrect final answer).
Experiments and Results:
DCoLT was evaluated on math reasoning (GSM8K, MATH, Sudoku) and code generation (MBPP, HumanEval) tasks.
- DCoLT-SEDD (400M model):
- Outperformed SFT-based CoT (GPT2) and DoT (SEDD) significantly.
- Achieved 96.2% on Sudoku 4×4 (vs. 79.4% for SEDD+DoT) and 57.0% on GSM8K-Aug (vs. 53.5% for SEDD+DoT).
- Analysis (Appendix A) showed DCoLT-SEDD learned an "easy-to-hard" progressive generation strategy for Sudoku and a non-linear, sample-dependent generation order for GSM8K-Aug, unlike the more rigid patterns of baselines.
- DCoLT-LLaDA (8B model):
- Achieved state-of-the-art results among existing DLMs.
- GSM8K: 88.1% (+9.8% over LLaDA baseline)
- MATH: 44.6% (+5.7% over LLaDA baseline)
- MBPP: 51.6% (+11.4% over LLaDA baseline)
- HumanEval: 59.1% (+19.5% over LLaDA baseline)
- These results were achieved using only public data and 16 H800 GPUs, outperforming other DLMs trained with SFT or RL, and even competitive with autoregressive models trained with significantly more proprietary data.
- Ablation studies confirmed the importance of the UPM and the benefit of adaptive layernorm within it.
- Longer generation sequence lengths at inference time (without retraining) improved performance on MATH, with further gains if fine-tuned on longer sequences.
- Analysis (Appendix B) visualized the thinking process, showing key information (numbers, symbols) emerging early, surrounded by grammatically connecting text later. The UPM's scores effectively guided the unmasking of more confident tokens.
Key Contributions:
- Introduction of DCoLT: A novel reasoning paradigm that leverages the reverse diffusion process for lateral thinking, optimized via outcome-based RL.
- Demonstration of Lateral Thinking: Shows that DLMs can be trained to reason non-linearly and bidirectionally, often generating key concepts before filling in details.
- UPM for Masked DLMs: A learnable Unmasking Policy Module for discrete-time masked DLMs that optimizes the order of token generation/unmasking.
- Strong Empirical Results: Significant performance improvements on complex reasoning tasks (math, code) for both continuous and discrete-time DLMs, showcasing the effectiveness and data efficiency of DCoLT.
Limitations:
- Current models are trained with limited data and compute, suggesting room for improvement with more resources.
- DCoLT has only been validated on tasks with objectively verifiable reward functions. Extending it to tasks requiring subjective rewards (via a learned reward model) is future work.
In summary, DCoLT presents a promising direction for enhancing the reasoning capabilities of diffusion LLMs by explicitly reinforcing their multi-step, non-linear generation process as a "chain of lateral thought." The method allows models to explore more diverse and flexible reasoning paths, leading to improved performance on challenging tasks.