Papers
Topics
Authors
Recent
Search
2000 character limit reached

Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

Published 23 Apr 2023 in cs.CL and cs.AI | (2304.11657v3)

Abstract: LLMs can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations. However, the reasoning chains of demonstrations generated by LLMs are prone to errors, which can subsequently lead to incorrect reasoning during inference. Furthermore, inappropriate exemplars (overly simplistic or complex), can affect overall performance among varying levels of difficulty. We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains. By utilizing iterative bootstrapping, our approach enables LLMs to autonomously rectify errors, resulting in more precise and comprehensive reasoning chains. Simultaneously, our approach selects challenging yet answerable questions accompanied by reasoning chains as exemplars with a moderate level of difficulty, which enhances the LLMs' generalizability across varying levels of difficulty. Experimental results indicate that Iter-CoT exhibits superiority, achieving competitive performance across three distinct reasoning tasks on ten datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  2. Palm: Scaling language modeling with pathways. arXiv Preprint, 2022. doi: 10.48550/arXiv.2204.02311. URL https://doi.org/10.48550/arXiv.2204.02311.
  3. Training verifiers to solve math word problems. arXiv Preprint, 2021. URL https://arxiv.org/abs/2110.14168.
  4. Active prompting with chain-of-thought for large language models. arXiv Preprint, 2023. doi: 10.48550/arXiv.2302.12246. URL https://doi.org/10.48550/arXiv.2302.12246.
  5. Complexity-based prompting for multi-step reasoning. arXiv Preprint, 2022. doi: 10.48550/arXiv.2210.00720. URL https://doi.org/10.48550/arXiv.2210.00720.
  6. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguistics, 9:346–361, 2021. doi: 10.1162/tacl_a_00370. URL https://doi.org/10.1162/tacl_a_00370.
  7. Learning to solve arithmetic word problems with verb categorization. In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp.  523–533. ACL, 2014. doi: 10.3115/v1/d14-1058. URL https://doi.org/10.3115/v1/d14-1058.
  8. Large language models are zero-shot reasoners. arXiv Preprint, 2022. doi: 10.48550/arXiv.2205.11916. URL https://doi.org/10.48550/arXiv.2205.11916.
  9. Parsing algebraic word problems into equations. Trans. Assoc. Comput. Linguistics, 3:585–597, 2015. doi: 10.1162/tacl_a_00160. URL https://doi.org/10.1162/tacl_a_00160.
  10. On the advance of making language models better reasoners. arXiv Preprint, 2022. doi: 10.48550/arXiv.2206.02336. URL https://doi.org/10.48550/arXiv.2206.02336.
  11. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp.  158–167, 2017. doi: 10.18653/v1/P17-1015. URL https://doi.org/10.18653/v1/P17-1015.
  12. A diverse corpus for evaluating and developing english math word problem solvers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp.  975–984, 2020. doi: 10.18653/v1/2020.acl-main.92. URL https://doi.org/10.18653/v1/2020.acl-main.92.
  13. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  14. Training language models to follow instructions with human feedback. arXiv Preprint, 2022. doi: 10.48550/arXiv.2203.02155. URL https://doi.org/10.48550/arXiv.2203.02155.
  15. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  2080–2094, 2021. doi: 10.18653/v1/2021.naacl-main.168. URL https://doi.org/10.18653/v1/2021.naacl-main.168.
  16. Measuring and narrowing the compositionality gap in language models. arXiv Preprint, 2022. doi: 10.48550/arXiv.2210.03350. URL https://doi.org/10.48550/arXiv.2210.03350.
  17. Scaling language models: Methods, analysis & insights from training gopher. arXiv Preprint, 2021. URL https://arxiv.org/abs/2112.11446.
  18. BLOOM: A 176b-parameter open-access multilingual language model. arXiv Preprint, 2022. doi: 10.48550/arXiv.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100.
  19. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv Preprint, abs/2302.00618, 2023. doi: 10.48550/arXiv.2302.00618. URL https://doi.org/10.48550/arXiv.2302.00618.
  20. Automatic prompt augmentation and selection with chain-of-thought from labeled data. arXiv Preprint, abs/2302.12822, 2023. doi: 10.48550/arXiv.2302.12822. URL https://doi.org/10.48550/arXiv.2302.12822.
  21. Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale generative language model. arXiv Preprint, 2022. URL https://arxiv.org/abs/2201.11990.
  22. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  4149–4158, 2019. doi: 10.18653/v1/n19-1421. URL https://doi.org/10.18653/v1/n19-1421.
  23. Lamda: Language models for dialog applications. arXiv Preprint, 2022. URL https://arxiv.org/abs/2201.08239.
  24. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  25. Self-consistency improves chain of thought reasoning in language models. arXiv Preprint, 2022. doi: 10.48550/arXiv.2203.11171. URL https://doi.org/10.48550/arXiv.2203.11171.
  26. Chain of thought prompting elicits reasoning in large language models. arXiv Preprint, 2022. URL https://arxiv.org/abs/2201.11903.
  27. Star: Bootstrapping reasoning with reasoning. arXiv Preprint, 2022. doi: 10.48550/arXiv.2203.14465. URL https://doi.org/10.48550/arXiv.2203.14465.
  28. Automatic chain of thought prompting in large language models. arXiv Preprint, 2022. doi: 10.48550/arXiv.2210.03493. URL https://doi.org/10.48550/arXiv.2210.03493.
  29. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  12697–12706. PMLR, 2021. URL http://proceedings.mlr.press/v139/zhao21c.html.
  30. Least-to-most prompting enables complex reasoning in large language models. arXiv Preprint, 2022. doi: 10.48550/arXiv.2205.10625. URL https://doi.org/10.48550/arXiv.2205.10625.
Citations (15)

Summary

  • The paper introduces Iter-CoT, an iterative bootstrapping pipeline with self-correction that enhances chain-of-thought demonstrations in LLMs.
  • It uses iterative error feedback to systematically refine and summarize rationales, improving accuracy across multiple complex reasoning benchmarks.
  • Empirical evaluations on arithmetic, commonsense, and symbolic tasks show robust performance gains and transferability even in label-free settings.

Iterative Bootstrapping for Chain-of-Thought Prompting in LLMs: The Iter-CoT Framework

Introduction and Motivation

Chain-of-Thought (CoT) prompting is a dominant paradigm for eliciting multi-step reasoning in LLMs, leveraging exemplar rationales to facilitate in-context learning. However, current approaches to constructing CoT demonstrations are hampered by three critical issues: exemplar selection misaligned with task difficulty, error propagation via flawed reasoning chains in demonstrations, and the absence of explicit self-correction or iterative refinement during demonstration generation. "Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in LLMs" (2304.11657) presents Iter-CoT, an iterative bootstrapping pipeline for constructing high-quality, self-corrected CoT demonstrations optimized for model performance.

Limitations of Existing CoT Paradigms

The two primary regimes for CoT prompting—manual authoring and automatic generation—exhibit complementary but unresolved limitations. Manual CoT annotation incurs significant cost, introduces human bias, and limits scalability. Automated CoT (e.g., Zero-Shot-CoT) leverages LLM outputs but is sensitive to exemplar correctness and quality; automatic methods often produce demonstrations containing erroneous rationales, which empirically degrade downstream inference accuracy.

Iter-CoT systematically interrogates these failure modes. It demonstrates that inappropriate exemplar complexity (e.g., overly simple rationales for high-hop queries, or vice versa) sharply reduces the transferability of reasoning (Figure 1); increasing rates of erroneous exemplars have a monotonic, negative influence on end-task accuracy (Figure 2). Moreover, prior works fail to utilize feedback or history from failed attempts, missing the opportunity for CoT demonstrators to exploit LLMs' inherent self-correction abilities (Figure 3). Figure 1

Figure 1: Effect of demonstration complexity on GSM8K for queries requiring varying numbers of reasoning hops.

Figure 2

Figure 2: Degradation of LLM accuracy on GSM8K, CSQA, and Letter tasks with increasing rates of erroneous exemplars.

Iter-CoT: Iterative Bootstrapping for Demonstration Construction

Iter-CoT orchestrates demonstration pool generation as a multi-phase, model-in-the-loop pipeline (see Figure 4):

  1. Initialization: Apply Zero-Shot-CoT to obtain initial rationales and answers; identify incorrectly answered items.
  2. Bootstrapping (Self-Correction): Iteratively prompt the model with error feedback ("Your answer is not right; can you think more carefully and give me the final answer?") on failed cases. Repeat until the generated answer is correct.
  3. Summarization: Prompt the model to provide a final, summarized solution incorporating the complete reasoning history accumulated during bootstrapping.

This sequential refinement exploits the LLM's intrinsic ability to self-correct with appropriate feedback, generating not only correct but also more comprehensive and context-aware rationales. Such bootstrapped exemplars are systematically more robust for in-context learning, even when initial attempts are flawed (Figure 5). Figure 3

Figure 3: Iterative re-answering boosts the rate of correct final answers via self-correction feedback on GSM8K.

Figure 5

Figure 5: Challenging yet answerable exemplars—refined through iterative revision—enhance LLM generalization.

Figure 4

Figure 4: Iter-CoT workflow comprising demonstration pool initialization, iterative bootstrapping, and rationalization.

Empirical Evaluation

Iter-CoT is benchmarked on ten datasets spanning arithmetic (GSM8K, AQuA, AddSub, SingleEq, SVAMP, ASDiv), commonsense (CSQA, StrategyQA, Date Understanding), and symbolic reasoning (Letter Concatenation). The method supports both oracle (label-available) and label-free settings (using a stronger LLM, e.g., GPT-4, as an evaluator for answer correctness). Iter-CoT is evaluated on multiple foundation models: GPT-3.5-turbo, GPT-4, Llama-2-70B, and Llama-2-70B-Chat.

Key findings include:

  • Iter-CoT outperforms both manual and automated CoT baselines across all categories.
  • In a fully automatic, label-free regime, Iter-CoT remains competitive with oracle-label variants, indicating robustness to evaluator noise.
  • Application of Self-Consistency decoding further amplifies gains across arithmetic and multi-step tasks, with notable deltas (e.g., GSM8K: +8.2%).
  • Ablations confirm that both bootstrapping and summarization are critical: omitting either phase degrades demonstration efficacy and downstream accuracy.
  • Iter-CoT-generated reasoning chains are consistently longer and structurally richer, empirically supporting the claim of enhanced comprehensiveness (see Appendix, Figure 6). Figure 7

Figure 7

Figure 7

Figure 7: Llama-2-70B-Chat model configuration for Iter-CoT evaluation.

Figure 8

Figure 8: Iter-CoT's accuracy as a function of bootstrapping iterations (more bootstrapping yields harder but more informative exemplars).

Figure 6

Figure 6: Comparative reasoning chain lengths confirm the increased comprehensiveness of Iter-CoT demonstrations.

Figure 9

Figure 9: Performance stratified by reasoning hops (complexity) on GSM8K; Iter-CoT delivers robust performance especially as hop-count increases.

Figure 10

Figure 10: Effect of seed exemplar count on Iter-CoT performance; the method is not strictly dependent on large numbers of shots.

Practical and Theoretical Implications

The strong empirical results demonstrate several implications:

  • Iterative, feedback-driven rationale generation should become a standard in automatic CoT pipeline design. The model's own capacity for self-correction and contextual rationalization surpasses traditional one-shot or single-pass approaches in demonstration pool quality.
  • Bootstrapped demonstrations alleviate the need for costly hand annotations, democratizing high-quality CoT pipeline construction for new tasks, especially in label-limited settings.
  • Selection of exemplars at intermediate difficulty supports better generalization, but including revised, error-corrected demonstrations is beneficial—not all faulty samples should be discarded. This is a departure from prior dogma that only perfect demonstrations should be retained during in-context learning.

The approach is model-agnostic and shows consistent gains on both proprietary (GPT-x) and open-source (Llama-2) LLMs, highlighting transferability. However, it does introduce additional demonstration pool construction costs (mainly in the iterative construction phase).

Future Directions

The Iter-CoT paradigm opens several avenues for further study:

  • Meta-learning for demonstration pool size and exemplar selection: Systematic exploration of demonstration pool composition (e.g., optimizing for coverage over reasoning types) could further improve generalization and reduce pool construction time.
  • Active learning integration: More efficient identification of "challenging yet answerable" exemplars could leverage uncertainty-based or dual-model selection approaches.
  • Evaluator model improvement: In label-free regimes, advances in auto-verification can increase robustness and minimize evaluator-induced bias in demonstration acceptance.

Conclusion

Iter-CoT establishes iterative bootstrapping combined with self-correction and contextual summarization as a definitive advance in constructing effective CoT prompting paradigms for LLMs. The framework achieves consistent, often state-of-the-art results across diverse task families and foundation models, and demonstrates that leveraging model-internal correction signals yields higher quality, structurally richer demonstrations. Its generality and performance suggest that feedback-driven, multi-phase rationale construction should become a core principle in automatic CoT prompt engineering.

(2304.11657)

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.