Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

Published 23 Apr 2023 in cs.CL and cs.AI | (2304.11657v3)

Abstract: LLMs can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations. However, the reasoning chains of demonstrations generated by LLMs are prone to errors, which can subsequently lead to incorrect reasoning during inference. Furthermore, inappropriate exemplars (overly simplistic or complex), can affect overall performance among varying levels of difficulty. We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains. By utilizing iterative bootstrapping, our approach enables LLMs to autonomously rectify errors, resulting in more precise and comprehensive reasoning chains. Simultaneously, our approach selects challenging yet answerable questions accompanied by reasoning chains as exemplars with a moderate level of difficulty, which enhances the LLMs' generalizability across varying levels of difficulty. Experimental results indicate that Iter-CoT exhibits superiority, achieving competitive performance across three distinct reasoning tasks on ten datasets.

Abstract PDF HTML Upgrade to Chat

Authors (7)

References (30)

Citations (15)

View on Semantic Scholar

Summary

The paper introduces Iter-CoT, an iterative bootstrapping pipeline with self-correction that enhances chain-of-thought demonstrations in LLMs.
It uses iterative error feedback to systematically refine and summarize rationales, improving accuracy across multiple complex reasoning benchmarks.
Empirical evaluations on arithmetic, commonsense, and symbolic tasks show robust performance gains and transferability even in label-free settings.

Iterative Bootstrapping for Chain-of-Thought Prompting in LLMs: The Iter-CoT Framework

Introduction and Motivation

Chain-of-Thought (CoT) prompting is a dominant paradigm for eliciting multi-step reasoning in LLMs, leveraging exemplar rationales to facilitate in-context learning. However, current approaches to constructing CoT demonstrations are hampered by three critical issues: exemplar selection misaligned with task difficulty, error propagation via flawed reasoning chains in demonstrations, and the absence of explicit self-correction or iterative refinement during demonstration generation. "Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in LLMs" (2304.11657) presents Iter-CoT, an iterative bootstrapping pipeline for constructing high-quality, self-corrected CoT demonstrations optimized for model performance.

Limitations of Existing CoT Paradigms

The two primary regimes for CoT prompting—manual authoring and automatic generation—exhibit complementary but unresolved limitations. Manual CoT annotation incurs significant cost, introduces human bias, and limits scalability. Automated CoT (e.g., Zero-Shot-CoT) leverages LLM outputs but is sensitive to exemplar correctness and quality; automatic methods often produce demonstrations containing erroneous rationales, which empirically degrade downstream inference accuracy.

Iter-CoT systematically interrogates these failure modes. It demonstrates that inappropriate exemplar complexity (e.g., overly simple rationales for high-hop queries, or vice versa) sharply reduces the transferability of reasoning (Figure 1); increasing rates of erroneous exemplars have a monotonic, negative influence on end-task accuracy (Figure 2). Moreover, prior works fail to utilize feedback or history from failed attempts, missing the opportunity for CoT demonstrators to exploit LLMs' inherent self-correction abilities (Figure 3).

Figure 1: Effect of demonstration complexity on GSM8K for queries requiring varying numbers of reasoning hops.

Figure 2: Degradation of LLM accuracy on GSM8K, CSQA, and Letter tasks with increasing rates of erroneous exemplars.

Iter-CoT: Iterative Bootstrapping for Demonstration Construction

Iter-CoT orchestrates demonstration pool generation as a multi-phase, model-in-the-loop pipeline (see Figure 4):

Initialization: Apply Zero-Shot-CoT to obtain initial rationales and answers; identify incorrectly answered items.
Bootstrapping (Self-Correction): Iteratively prompt the model with error feedback ("Your answer is not right; can you think more carefully and give me the final answer?") on failed cases. Repeat until the generated answer is correct.
Summarization: Prompt the model to provide a final, summarized solution incorporating the complete reasoning history accumulated during bootstrapping.

This sequential refinement exploits the LLM's intrinsic ability to self-correct with appropriate feedback, generating not only correct but also more comprehensive and context-aware rationales. Such bootstrapped exemplars are systematically more robust for in-context learning, even when initial attempts are flawed (Figure 5).

Figure 3: Iterative re-answering boosts the rate of correct final answers via self-correction feedback on GSM8K.

Figure 5: Challenging yet answerable exemplars—refined through iterative revision—enhance LLM generalization.

Figure 4: Iter-CoT workflow comprising demonstration pool initialization, iterative bootstrapping, and rationalization.

Empirical Evaluation

Iter-CoT is benchmarked on ten datasets spanning arithmetic (GSM8K, AQuA, AddSub, SingleEq, SVAMP, ASDiv), commonsense (CSQA, StrategyQA, Date Understanding), and symbolic reasoning (Letter Concatenation). The method supports both oracle (label-available) and label-free settings (using a stronger LLM, e.g., GPT-4, as an evaluator for answer correctness). Iter-CoT is evaluated on multiple foundation models: GPT-3.5-turbo, GPT-4, Llama-2-70B, and Llama-2-70B-Chat.

Key findings include:

Iter-CoT outperforms both manual and automated CoT baselines across all categories.
In a fully automatic, label-free regime, Iter-CoT remains competitive with oracle-label variants, indicating robustness to evaluator noise.
Application of Self-Consistency decoding further amplifies gains across arithmetic and multi-step tasks, with notable deltas (e.g., GSM8K: +8.2%).
Ablations confirm that both bootstrapping and summarization are critical: omitting either phase degrades demonstration efficacy and downstream accuracy.
Iter-CoT-generated reasoning chains are consistently longer and structurally richer, empirically supporting the claim of enhanced comprehensiveness (see Appendix, Figure 6).

Figure 7: Llama-2-70B-Chat model configuration for Iter-CoT evaluation.

Figure 8: Iter-CoT's accuracy as a function of bootstrapping iterations (more bootstrapping yields harder but more informative exemplars).

Figure 6: Comparative reasoning chain lengths confirm the increased comprehensiveness of Iter-CoT demonstrations.

Figure 9: Performance stratified by reasoning hops (complexity) on GSM8K; Iter-CoT delivers robust performance especially as hop-count increases.

Figure 10: Effect of seed exemplar count on Iter-CoT performance; the method is not strictly dependent on large numbers of shots.

Practical and Theoretical Implications

The strong empirical results demonstrate several implications:

Iterative, feedback-driven rationale generation should become a standard in automatic CoT pipeline design. The model's own capacity for self-correction and contextual rationalization surpasses traditional one-shot or single-pass approaches in demonstration pool quality.
Bootstrapped demonstrations alleviate the need for costly hand annotations, democratizing high-quality CoT pipeline construction for new tasks, especially in label-limited settings.
Selection of exemplars at intermediate difficulty supports better generalization, but including revised, error-corrected demonstrations is beneficial—not all faulty samples should be discarded. This is a departure from prior dogma that only perfect demonstrations should be retained during in-context learning.

The approach is model-agnostic and shows consistent gains on both proprietary (GPT-x) and open-source (Llama-2) LLMs, highlighting transferability. However, it does introduce additional demonstration pool construction costs (mainly in the iterative construction phase).

Future Directions

The Iter-CoT paradigm opens several avenues for further study:

Meta-learning for demonstration pool size and exemplar selection: Systematic exploration of demonstration pool composition (e.g., optimizing for coverage over reasoning types) could further improve generalization and reduce pool construction time.
Active learning integration: More efficient identification of "challenging yet answerable" exemplars could leverage uncertainty-based or dual-model selection approaches.
Evaluator model improvement: In label-free regimes, advances in auto-verification can increase robustness and minimize evaluator-induced bias in demonstration acceptance.

Conclusion

Iter-CoT establishes iterative bootstrapping combined with self-correction and contextual summarization as a definitive advance in constructing effective CoT prompting paradigms for LLMs. The framework achieves consistent, often state-of-the-art results across diverse task families and foundation models, and demonstrates that leveraging model-internal correction signals yields higher quality, structurally richer demonstrations. Its generality and performance suggest that feedback-driven, multi-phase rationale construction should become a core principle in automatic CoT prompt engineering.

(2304.11657)

Markdown Report Issue