Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models (2402.07754v3)

Published 12 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive LLMs. In contrast to autoregressive LLMs that make decisions in a left-to-right, token-by-token manner, DoT allows reasoning steps to diffuse over time through a diffusion LLM and offers greater flexibility in trading-off computation for reasoning performance. Our experimental results demonstrate the effectiveness of DoT in multi-digit multiplication, boolean logic, and grade school math problems, with a small diffusion model outperforming a much larger autoregressive model in both efficiency and accuracy. In addition to that, DoT showcases promising self-correction abilities and benefits from existing reasoning-enhancing techniques like self-consistency decoding. Our findings contribute to the understanding and development of reasoning with diffusion LLMs.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces Diffusion-of-Thought (DoT), a novel paradigm that embeds chain-of-thought reasoning within an iterative denoising process.
The methodology uses classifier-free guidance and scheduled sampling to achieve 100% accuracy and up to 27× throughput improvements in multi-digit arithmetic tasks.
The paper demonstrates that DoT enables bidirectional self-correction and flexible computation-reasoning trade-offs, improving performance on challenging math benchmarks.

Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion LLMs

The work "Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion LLMs" (2402.07754) investigates the integration of chain-of-thought (CoT) reasoning into diffusion-based LLMs, proposing a paradigm referred to as Diffusion-of-Thought (DoT). The approach systematically contrasts the classical sequential, autoregressive (AR) methods, such as those present in GPT-2 and derivatives, with novel parallel and iterative reasoning mechanisms enabled by score-based diffusion models. The paper provides a detailed empirical and methodological analysis of DoT's capabilities, particularly on mathematical reasoning tasks, and rigorously evaluates its performance, efficiency, and underlying properties relevant to reasoning in neural sequence models.

Methodological Foundation

DoT is designed to enable the generation of intermediate reasoning steps (i.e., "thoughts" or rationales) within the forward and reverse processes of a diffusion model. Unlike conventional AR models that generate tokens strictly left-to-right, DoT operates by iteratively denoising a continuous latent representation in which reasoning steps are embedded. The noising and denoising schedules are engineered such that only the rationale-component of the input is stochastically perturbed, while problem context is held fixed, facilitating conditional reasoning.

Key implementation features include:

Classifier-free guidance for conditioning on queries: Rather than relying on gradient-based token guidance, the approach employs classifier-free guidance for stable control over the denoising process, improving reliability in token-critical tasks like arithmetic reasoning.
Self-correction via scheduled sampling: A scheduled sampling regime is applied during training, exposing the model to its own intermediate predictions during denoising, which enhances robustness to accumulated reasoning errors—a notable limitation in standard AR CoT models.
Multi-pass and glancing sampling variants: The multi-pass extension iterates generation of rationales in causal order, resembling AR CoT but with parallel denoising within each rationale. Glancing sampling further enables the model to reference partially generated future thoughts, approximating a non-causal, global refinement procedure.
Inference acceleration via DPM-Solver and flexibility in reasoning-efficiency trade-offs: Denoising step count ( $T$ ) becomes a controllable parameter, allowing practical adjustment of compute budget versus output quality.

Experimental Results

Mathematical Reasoning Tasks

Empirical evaluation centers on multi-digit arithmetic (4x4 and 5x5 multiplication; BIG-bench) and grade school math (GSM8K). The principal results are:

DoT achieves 100% accuracy and vastly superior throughput in multi-digit multiplication: When trained from scratch, DoT attains perfect accuracy at up to 27× higher throughput than AR CoT-GPT-2. For instance, DoT achieves 100% accuracy and 62.5 it/s, compared to GPT-2 CoT-finetune-large's 0.8 it/s.
On GSM8K, DoT matches or surpasses AR and implicit CoT with substantial efficiency benefits: Fine-tuned on Plaid 1B (the largest available diffusion LM), DoT achieves 32.6% accuracy, increasing to 36.3% with self-consistency. The multi-pass DoT variant further raises performance to 37.7% (40.3% with self-consistency), approaching GPT-2 CoT-finetuned performance, and significantly outperforming similarly sized implicit CoT models.

Properties and Ablations

Computation-reasoning tradeoff: By adjusting diffusion steps, DoT flexibly trades off inference cost and accuracy. This enables task-adaptive deployment, where simpler queries require fewer denoising passes for accurate output.
Self-correction capabilities: Case studies reveal that DoT can revise both early and later reasoning steps during denoising—a distinction from strictly left-to-right AR generation—demonstrating both backward and forward error correction.
Diversity and self-consistency: DoT inherently generates diverse reasoning trajectories via noise sampling, and when coupled with self-consistency voting, achieves measurable gains in accuracy with increasing sample count.

Theoretical and Practical Implications

This work establishes that score-based diffusion LLMs, when equipped with explicit CoT procedures, are effective vehicles for complex algorithmic and mathematical reasoning. Several implications and directions merit further attention:

Global reasoning and non-causal inference: Unlike AR LMs that strictly condition sequentially, DoT’s iterative denoising allows reasoning steps to influence each other bi-directionally, facilitating global revisions. This property is particularly useful in domains where interdependencies between intermediate reasoning states are nontrivial.
Efficiency and scalability: The parallelism inherent in diffusion approaches (especially when $T$ is small) yields large efficiency gains, making DoT attractive for scenarios demanding high-throughput or low-latency reasoning. However, the efficacy of DoT on more linguistically demanding reasoning tasks remains constrained by the current scale and generality of available diffusion LMs.
Sampling-based diversity: Diversity in generated rationales comes at minimal computational overhead, and the natural alignment with self-consistency or majority-vote strategies suggests utility in low-resource or robust reasoning applications.
Limitations and scaling potential: While DoT lags large AR LMs (e.g., GPT-3/4) on open-ended reasoning benchmarks due to model size and training data limitations, the training recipe and architectural modifications are orthogonal to model scaling. As diffusion LMs mature and increase in scale, these results imply DoT-like methods could become increasingly competitive for general-purpose reasoning.

Future Developments and Open Questions

There are several promising directions flowing from this work:

Scaling and instruction tuning: As pre-trained diffusion LMs approach the scale and diversity of AR counterparts, combining DoT with multi-task or instruction-tuned objectives could close the observed performance gap on challenging benchmarks.
Integration with advanced reasoning and search algorithms: The compatibility of DoT with techniques such as Tree-of-Thoughts or reflexion-based decoding could further enhance complex problem-solving.
Early exit and adaptive reasoning: Given the capacity to maintain predictions across steps, integrating early exit criteria or adaptive step selection could further optimize efficiency, especially for workloads with heterogeneous difficulty.

In summary, this paper demonstrates that diffusion models, when endowed with DoT-style chain-of-thought reasoning, form a compelling alternative to autoregressive models for tasks prioritizing throughput, self-correction, and global reasoning. The architectural and training choices outlined provide a pathway for the continued development of reasoning-capable diffusion LMs in both research and practical deployments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JiachengYe15/status/1761272723962216861

https://twitter.com/JiachengYe15/status/1763130873967526134

https://twitter.com/Ji_Ha_Kim/status/1757542009819041960