Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models (2407.03181v1)

Published 3 Jul 2024 in cs.CL

Abstract: Requiring a LLM to generate intermediary reasoning steps has been shown to be an effective way of boosting performance. In fact, it has been found that instruction tuning on these intermediary reasoning steps improves model performance. In this work, we present a novel method of further improving performance by requiring models to compare multiple reasoning chains before generating a solution in a single inference step. We call this method Divergent CoT (DCoT). We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, LLMs. Through a rigorous set of experiments spanning a wide range of tasks that require various reasoning types, we show that fine-tuning on DCoT consistently improves performance over the CoT baseline across model families and scales (1.3B to 70B). Through a combination of empirical and manual evaluation, we additionally show that these performance gains stem from models generating multiple divergent reasoning chains in a single inference step, indicative of the enabling of self-correction in LLMs. Our code and data are publicly available at https://github.com/UKPLab/arxiv2024-divergent-cot.

PDF HTML Abstract

Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in LLMs

This paper introduces a novel approach to enhancing the reasoning capabilities of LLMs titled Divergent Chain of Thought (DCoT). This method builds on the foundation of Chain of Thought (CoT) prompting, which improves performance by generating intermediate reasoning steps. However, the proposed DCoT method advances this approach by facilitating the generation and comparison of multiple reasoning chains in a single inference step, thus potentially increasing accuracy in the final solutions provided by the model.

Methodology

The key innovation of the DCoT framework is in its ability to instruct LLMs to produce several divergent reasoning paths before arriving at a final decision. This is inspired by the cognitive theories of Divergent and Convergent Thinking, which suggest a multi-phase approach to problem-solving. The process involves generating numerous ideas (divergent phase) and synthesizing them to derive a single solution (convergent phase).

For implementation, DCoT requires fine-tuning models with datasets that contain multiple reasoning paths per question, allowing the model to learn how to generate and select among various potential solutions. This methodology addresses the limitation faced by prior models which could not generate multiple inference chains simultaneously due to the complexity of the task.

Results

The experimentation spanned across models with parameter sizes ranging from 1.3B to 70B, demonstrating consistent improvement over baseline CoT models. Notably, the empirical results substantiate that even smaller, more accessible LLMs benefit from this fine-tuning approach. The performance boost was significant across a variety of tasks, indicative of the method's broad applicability.

Quantitatively, the work showed improvements in task performance across various datasets, including mathematics, logic, and multi-hop reasoning tasks. Furthermore, the introduction of DCoT allowed some models to enhance their accuracy without additional external feedback, indicating a self-correcting capability—a novel advancement in the field.

Implications and Future Directions

The implications of this research are multifaceted. Practically, the introduction of DCoT empowers smaller models to achieve enhanced performance, making high-quality reasoning tasks more accessible without requiring extensive computational resources. This democratizes access to powerful AI and broadens the range of applications for which these LLMs can be effectively utilized.

Theoretically, the success of this method suggests that further exploration into divergent thinking strategies might unlock additional reasoning capabilities in LLMs. The framework presents a new paradigm where multi-step reasoning does not rely solely on external oversight or feedback loops.

Future research may explore the integration of DCoT within larger, more context-rich models or alternative reasoning paradigms such as code prompting or graph-based reasoning. Additionally, investigating the differential impacts of various scales of divergent reasoning (i.e., number of reasoning chains generated) could offer deeper insights into optimizing model training and inference strategies.

This research underscores the value of fine-tuning with complex reasoning data and sets the stage for subsequent advancements in enhancing AI reasoning through refined model training techniques.