- The paper introduces a self-evolving, multi-modal framework that jointly trains LLMs on natural language, code, and truth table reasoning.
- The method uses a majority voting mechanism across modalities to achieve significant accuracy gains, including up to +11.7pp improvement on benchmarks.
- The novel truth table chain-of-thought effectively mitigates common errors in LLM reasoning, such as missing branches and invalid converses.
This paper introduces Mixture-of-Thought (MoT), a framework designed to enhance the logical reasoning capabilities of LLMs by enabling them to learn and reason across multiple complementary modalities: natural language, code, and a novel symbolic modality based on truth tables. The core idea is that human reasoning often involves switching between different representational formats, and current LLMs, typically trained and inferring in a single modality (usually natural language), lack this flexibility.
The authors identify key limitations in existing approaches:
- Most LLM-based methods operate with a single reasoning modality during training.
- While some methods use modality selection or augmentation at inference, the training process remains modality-blind, limiting synergy.
- Natural language reasoning in LLMs often suffers from errors like missing branches in disjunctions and invalid converse inferences.
To address these, MoT proposes a two-phase design:
- Self-Evolving MoT Training: This phase aims to jointly improve the model's reasoning ability in each modality.
- It starts by prompting an LLM to generate reasoning rationales (thoughts) for problems in each of the three modalities:
- Natural Language CoT: Standard chain-of-thought reasoning in natural language.
- Code CoT: The problem is translated into Python code (not executed, but treated as a structured logical representation), and reasoning proceeds based on this code.
- Truth Table CoT: A new symbolic approach where the LLM:
- Grounds first-order logic into propositional predicates.
- Constructs a partial truth table by pruning assignments that violate premises.
- Infers the answer by checking if the conclusion holds across remaining valid assignments. This is designed to mitigate missing branches and invalid converse errors common in NL reasoning.
- These generated rationales are filtered for correctness (correct final answer) and format consistency (e.g., presence of specific tags like
<nl_cot>
, <code>
, <truth_table>
, and valid code structures).
- The LLM is then fine-tuned on this high-quality, multi-modal dataset of self-generated rationales.
- This process is iterative: the improved model from one round generates better rationales for the next round, leading to self-evolution. The training uses an on-policy approach, fine-tuning the model from the previous round, and few-shot prompting is only used in the first round.
- MoT Inference: During inference, the trained MoT model generates responses for a given problem in all three modalities (elicited by specific tags).
- A simple majority voting mechanism is then applied to the answers produced by each modality to determine the final prediction. Ties are broken randomly.
Key Contributions and Findings:
- Introduction of Truth Table CoT: A novel symbolic reasoning modality for LLMs that systematically enumerates logical cases, addressing specific failure modes of natural language reasoning.
- Self-Evolving MoT Training Algorithm: A method to jointly train an LLM across multiple modalities using filtered, self-generated data, improving reasoning in each modality and fostering synergy.
- Demonstrated Complementarity: Experiments show that the three modalities (NL, Code, Truth Table) often solve different sets of problems, and their union covers a larger set of problems than any single modality or pair of modalities.
- Significant Performance Gains:
- MoT consistently and significantly outperformed strong LLM baselines (Gemma-2-2B-IT, Gemma-2-9B-IT, Qwen-2.5-7B-Instruct) using single-modality chain-of-thought on logical reasoning benchmarks like FOLIO and ProofWriter, achieving up to +11.7pp average accuracy gain.
- The MoT-trained 9B parameter model matched the performance of GPT-4 + Logic-LM on FOLIO.
- Effectiveness of MoT Training: MoT training leads to better performance in each individual modality compared to training on a single modality alone. A single MoT-trained model can switch between modalities effectively.
- Benefits for Harder Problems: MoT shows greater improvements on more difficult logical reasoning problems, especially those with greater reasoning depths.
- Error Analysis:
- Natural language reasoning often fails due to "missing branches" (not considering all cases in disjunctions) and "invalid converse" errors.
- The truth-table modality helps overcome these specific bottlenecks by explicitly enumerating possibilities.
- Efficient Test-Time Scaling: MoT sampling (sampling an equal number of responses from each modality for a given budget) achieves higher pass@k scores compared to single-thought sampling, indicating better use of inference compute due to increased response diversity.
Implementation Details:
- Modality-Specific Tags: Tags like
<nl_cot>
, <code>
, <truth_table>
are used to guide the LLM during both training data generation and inference.
- Quality Filtering: Generated rationales are filtered based on the correctness of the final answer and format consistency (e.g., presence of
def
and class
in code traces).
- Training Setup: Used models like Gemma-2-2B-IT, Gemma-2-9B-IT, and Qwen-2.5-7B-Instruct. Training involved 2-3 rounds of self-evolution, fine-tuning for 2 epochs per round with a learning rate of 2e-5.
- Datasets: FOLIO and ProofWriter (depth-5 subset).
- Inference: vLLM engine used for efficiency. Temperature 0.7 for evaluation.
The paper argues that by explicitly training LLMs to operate across complementary reasoning modalities, they can achieve more robust, versatile, and human-like logical reasoning. The self-evolving training allows the model to bootstrap its capabilities without requiring extensive manually annotated multi-modal reasoning traces.