Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models (2304.11657v3)

Published 23 Apr 2023 in cs.CL and cs.AI
Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

Abstract: LLMs can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations. However, the reasoning chains of demonstrations generated by LLMs are prone to errors, which can subsequently lead to incorrect reasoning during inference. Furthermore, inappropriate exemplars (overly simplistic or complex), can affect overall performance among varying levels of difficulty. We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains. By utilizing iterative bootstrapping, our approach enables LLMs to autonomously rectify errors, resulting in more precise and comprehensive reasoning chains. Simultaneously, our approach selects challenging yet answerable questions accompanied by reasoning chains as exemplars with a moderate level of difficulty, which enhances the LLMs' generalizability across varying levels of difficulty. Experimental results indicate that Iter-CoT exhibits superiority, achieving competitive performance across three distinct reasoning tasks on ten datasets.

Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in LLMs

The paper "Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in LLMs" by Jiashuo Sun et al., introduces Iterative Chain-of-Thought (Iter-CoT) prompting, a novel approach aimed at improving reasoning in LLMs. This research addresses notable challenges in existing CoT prompting techniques, particularly errors in generated reasoning chains and the selection of exemplars of varying difficulty, which can impede LLM performance on reasoning tasks.

Overview of the Iter-CoT Approach

Iter-CoT is designed with two primary enhancements: iterative error correction (bootstrapping) and selection of appropriate exemplars. The bootstrapping aspect enables LLMs to autonomously rectify mistakes in reasoning chains, enhancing accuracy and consistency. This process leverages hints and contextual information from previous errors to guide the model towards better conclusions.

The second enhancement focuses on exemplar selection. Iter-CoT purposely selects questions that are challenging yet answerable, thus improving the generalization capabilities of LLMs across questions of various difficulty levels. This selection process assumes that mid-range difficulty exemplars are most beneficial for training, as overly simplistic or complex ones do not contribute optimally to model learning.

Key Results and Contributions

The experimental results underpin the efficacy of Iter-CoT. Applied to ten datasets spanning arithmetic, commonsense, and symbolic reasoning tasks, Iter-CoT outperformed existing methods. Specifically, the experimental setup demonstrated that Iter-CoT, both with and without label availability, achieved state-of-the-art results. Without relying on annotations or labels, the Iter-CoT(w/o label) variant exhibited competitive performance, highlighting its robustness.

One significant observation was Iter-CoT's superior performance compared to approaches requiring manual annotations. Across multiple reasoning datasets, including GSM8K and Letter Concatenation, Iter-CoT demonstrated notable accuracy improvements, particularly when combined with Self-Consistency (SC) strategies. SC further augmented Iter-CoT by facilitating multiple answer generation, subsequently refined by a majority voting mechanism to ensure selection of the most plausible answer.

Implications and Speculations

The practical implications of Iter-CoT are substantial, especially in domains where accurate reasoning is crucial, such as automated theorem proving, scientific research, and educational tools. By enhancing LLM resilience to errors and refining exemplar selection, Iter-CoT fosters more reliable reasoning capabilities, which could be pivotal in formalizing AI's role in knowledge processing and discovery.

Theoretically, Iter-CoT underscores the importance of iterative learning and contextual adaptation in AI. The approach aligns with the broader AI trend towards self-improvement mechanisms, prompting speculation on future advancements wherein LLMs might leverage iterative refinement to overcome evolving complexities in dynamic reasoning environments.

Conclusion

Overall, the paper contributes valuable insights into CoT prompting techniques, advancing LLM reasoning through systematic error correction and exemplar selection. Iter-CoT symbolizes a forward-thinking strategy, potentially guiding future research towards gradually autonomous improvement methodologies. Continued exploration in this direction could yield further breakthroughs in AI's capability to perform complex reasoning tasks reliably and efficiently.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  2. Palm: Scaling language modeling with pathways. arXiv Preprint, 2022. doi: 10.48550/arXiv.2204.02311. URL https://doi.org/10.48550/arXiv.2204.02311.
  3. Training verifiers to solve math word problems. arXiv Preprint, 2021. URL https://arxiv.org/abs/2110.14168.
  4. Active prompting with chain-of-thought for large language models. arXiv Preprint, 2023. doi: 10.48550/arXiv.2302.12246. URL https://doi.org/10.48550/arXiv.2302.12246.
  5. Complexity-based prompting for multi-step reasoning. arXiv Preprint, 2022. doi: 10.48550/arXiv.2210.00720. URL https://doi.org/10.48550/arXiv.2210.00720.
  6. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguistics, 9:346–361, 2021. doi: 10.1162/tacl_a_00370. URL https://doi.org/10.1162/tacl_a_00370.
  7. Learning to solve arithmetic word problems with verb categorization. In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp.  523–533. ACL, 2014. doi: 10.3115/v1/d14-1058. URL https://doi.org/10.3115/v1/d14-1058.
  8. Large language models are zero-shot reasoners. arXiv Preprint, 2022. doi: 10.48550/arXiv.2205.11916. URL https://doi.org/10.48550/arXiv.2205.11916.
  9. Parsing algebraic word problems into equations. Trans. Assoc. Comput. Linguistics, 3:585–597, 2015. doi: 10.1162/tacl_a_00160. URL https://doi.org/10.1162/tacl_a_00160.
  10. On the advance of making language models better reasoners. arXiv Preprint, 2022. doi: 10.48550/arXiv.2206.02336. URL https://doi.org/10.48550/arXiv.2206.02336.
  11. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp.  158–167, 2017. doi: 10.18653/v1/P17-1015. URL https://doi.org/10.18653/v1/P17-1015.
  12. A diverse corpus for evaluating and developing english math word problem solvers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp.  975–984, 2020. doi: 10.18653/v1/2020.acl-main.92. URL https://doi.org/10.18653/v1/2020.acl-main.92.
  13. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  14. Training language models to follow instructions with human feedback. arXiv Preprint, 2022. doi: 10.48550/arXiv.2203.02155. URL https://doi.org/10.48550/arXiv.2203.02155.
  15. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp.  2080–2094, 2021. doi: 10.18653/v1/2021.naacl-main.168. URL https://doi.org/10.18653/v1/2021.naacl-main.168.
  16. Measuring and narrowing the compositionality gap in language models. arXiv Preprint, 2022. doi: 10.48550/arXiv.2210.03350. URL https://doi.org/10.48550/arXiv.2210.03350.
  17. Scaling language models: Methods, analysis & insights from training gopher. arXiv Preprint, 2021. URL https://arxiv.org/abs/2112.11446.
  18. BLOOM: A 176b-parameter open-access multilingual language model. arXiv Preprint, 2022. doi: 10.48550/arXiv.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100.
  19. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv Preprint, abs/2302.00618, 2023. doi: 10.48550/arXiv.2302.00618. URL https://doi.org/10.48550/arXiv.2302.00618.
  20. Automatic prompt augmentation and selection with chain-of-thought from labeled data. arXiv Preprint, abs/2302.12822, 2023. doi: 10.48550/arXiv.2302.12822. URL https://doi.org/10.48550/arXiv.2302.12822.
  21. Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale generative language model. arXiv Preprint, 2022. URL https://arxiv.org/abs/2201.11990.
  22. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  4149–4158, 2019. doi: 10.18653/v1/n19-1421. URL https://doi.org/10.18653/v1/n19-1421.
  23. Lamda: Language models for dialog applications. arXiv Preprint, 2022. URL https://arxiv.org/abs/2201.08239.
  24. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  25. Self-consistency improves chain of thought reasoning in language models. arXiv Preprint, 2022. doi: 10.48550/arXiv.2203.11171. URL https://doi.org/10.48550/arXiv.2203.11171.
  26. Chain of thought prompting elicits reasoning in large language models. arXiv Preprint, 2022. URL https://arxiv.org/abs/2201.11903.
  27. Star: Bootstrapping reasoning with reasoning. arXiv Preprint, 2022. doi: 10.48550/arXiv.2203.14465. URL https://doi.org/10.48550/arXiv.2203.14465.
  28. Automatic chain of thought prompting in large language models. arXiv Preprint, 2022. doi: 10.48550/arXiv.2210.03493. URL https://doi.org/10.48550/arXiv.2210.03493.
  29. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  12697–12706. PMLR, 2021. URL http://proceedings.mlr.press/v139/zhao21c.html.
  30. Least-to-most prompting enables complex reasoning in large language models. arXiv Preprint, 2022. doi: 10.48550/arXiv.2205.10625. URL https://doi.org/10.48550/arXiv.2205.10625.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jiashuo Sun (11 papers)
  2. Yi Luo (153 papers)
  3. Yeyun Gong (78 papers)
  4. Chen Lin (75 papers)
  5. Yelong Shen (83 papers)
  6. Jian Guo (76 papers)
  7. Nan Duan (172 papers)
Citations (15)