Self-Harmonized Chain of Thought (2409.04057v1)

Published 6 Sep 2024 in cs.CL

Abstract: Chain-of-Thought (CoT) prompting reveals that LLMs are capable of performing complex reasoning via intermediate steps. CoT prompting is primarily categorized into three approaches. The first approach utilizes straightforward prompts like ``Let's think step by step'' to generate a sequential thought process before yielding an answer. The second approach makes use of human-crafted, step-by-step demonstrations to guide the model's reasoning process. The third automates the generation of reasoned demonstrations with the 'Let's think step by step'.This approach sometimes leads to reasoning errors, highlighting the need to diversify demonstrations to mitigate its misleading effects. However, diverse demonstrations pose challenges for effective representations. In this work, we propose ECHO, a self-harmonized chain-of-thought prompting method. It consolidates diverse solution paths into a uniform and effective solution pattern.ECHO demonstrates the best overall performance across three reasoning domains.

PDF Abstract

Self-Harmonized Chain of Thought

The paper "Self-Harmonized Chain of Thought" by Ziqi Jin and Wei Lu introduces ECHO, a novel method aiming to improve the performance of LLMs in complex reasoning tasks by unifying diverse rationale patterns into a coherent and effective solution structure. This work addresses some of the limitations inherent in existing Chain-of-Thought (CoT) prompting methods and proposes a mechanism that enhances the robustness and accuracy of reasoning processes across different domains.

Introduction

CoT prompting has significantly advanced the reasoning capabilities of LLMs by encouraging them to decompose complex problems into intermediate steps. Primarily, CoT methods are divided into two paradigms: Zero-shot-CoT and Few-shot-CoT. Zero-shot-CoT employs a universal prompt to form reasoning chains without specific examples, leveraging general prompts like "Let's think step by step." Few-shot-CoT, on the other hand, improves performance by using human-crafted examples that guide the reasoning process. The lack of scalability and cost of generating human-crafted examples led to the development of Auto-CoT, which automates the few-shot prompting by clustering similar questions and generating rationale steps using Zero-shot-CoT.

However, Auto-CoT's reliance on similarity can cause "misleading by similarity," where reasoning errors in similar demonstrations propagate and mislead the model. To mitigate these issues, Auto-CoT diversifies demonstrations, but this introduces new challenges, such as irrelevant or overly complex rationale patterns. ECHO aims to overcome these by unifying diverse rationale patterns into a coherent framework, leveraging a self-harmonization process.

Method

ECHO's method consists of three main steps:

Question Clustering: Questions from a given dataset are clustered based on similarity using Sentence-BERT embeddings, with $k$ clusters formed using the k-means algorithm.
Demonstration Sampling: From each cluster, representative questions are selected, and initial rationale steps are generated using Zero-shot-CoT. These rationales are subject to specific criteria to filter effective samples.
Demonstration Unification: An iterative process where each rationale is regenerated using others in the set as in-context examples, refining into a consistent pattern over iterations.

The unification process involves updating each rationale within multiple iterations to ensure they align towards a balanced solution pattern, enhancing the robustness and generalizability of the demonstrations.

Experimental Results

ECHO was evaluated across ten datasets in three reasoning domains: arithmetic, commonsense, and symbolic reasoning. The results indicate that ECHO surpasses Auto-CoT by approximately 2.8% in overall performance. Notable findings include:

ECHO shows competitive results in domains like symbolic reasoning, achieving high accuracy and consistency across multiple iterations.
However, the method reveals variations in performance based on iteration counts and the inherent complexity of the tasks.

Comparison with Manual Prompts

Comparison with manually crafted prompts showcased that generated rationales initially lagged in performance but, after applying ECHO, they matched and sometimes exceeded manual prompt performance. This suggests the model's ability to learn cohesive patterns from diverse and even imperfect demonstrations.

Effect of Hyperparameters

ECHO's performance is sensitive to nuances in the dataset and the chosen hyperparameters. With more iterations, it gained improvements but also showed tendencies to overfit, indicating an optimal balance is necessary for sustained performance.

Implications and Future Direction

The research suggests that ECHO's self-harmonization mechanism can mitigate issues related to the misleading effects of similarity and the diversity of demonstrations. This framework offers a scalable and robust approach to improving CoT reasoning in LLMs without extensive manual intervention. Future research could explore adaptive mechanisms to further refine the unification processes and potentially extend methodologies to even broader and more diverse datasets.

Conclusion

ECHO represents a significant step towards automating and harmonizing the CoT process in LLMs, reducing the need for labor-intensive manual prompt design while enhancing model performance across varied reasoning tasks. The promising results and insights from this paper pave the way for further advancements in automatic reasoning frameworks in artificial intelligence.