Mitigating Misleading Chain-of-Thought Reasoning with Selective Filtering (2403.19167v1)
Abstract: LLMs have manifested remarkable capabilities by leveraging chain-of-thought (CoT) reasoning techniques to solve intricate questions through step-by-step reasoning chains. Despite its success, the efficacy of such reasoning is inherently contingent upon the quality of CoT. However, flawless CoT reasoning cannot be guaranteed due to the presence of indecomposable questions and the potential for erroneous reasoning chains, particularly in the case of small-scale LLMs. To tackle this challenge, we propose a novel approach called the selective filtering reasoner (SelF-Reasoner) that assesses the entailment relationship between the question and the candidate reasoning chain. Then, we proceed with CoT reasoning when the reasoning chain demonstrates confidence; otherwise, we opt to predict the answer directly. SelF-Reasoner improves the fine-tuned T5 baseline consistently over the ScienceQA, ECQA, and LastLetter tasks. Code is available at \texttt{https://github.com/LibroWu/SelF-Reasoner}.
- Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3050–3065, Online. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Palm: Scaling language modeling with pathways.
- Training verifiers to solve math word problems.
- Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246.
- An image is worth 16x16 words: Transformers for image recognition at scale.
- Large language models are reasoning teachers.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes.
- Unifiedqa: Crossing format boundaries with a single QA system. CoRR, abs/2005.00700.
- Large language models are zero-shot reasoners. In Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022).
- Explanations from large language models make small reasoners better.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:2209.09513.
- Chameleon: Plug-and-play compositional reasoning with large language models.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098.
- Teaching small language models to reason.
- Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop.
- Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
- Language models are unsupervised multitask learners.
- Scaling language models: Methods, analysis & insights from training gopher.
- Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140):1–67.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Conceptnet 5.5: An open multilingual graph of general knowledge.
- Enhancing chain-of-thoughts prompting with iterative bootstrapping in large language models. arXiv preprint arXiv:2304.11657.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Lamda: Language models for dialog applications.
- Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. arXiv preprint arXiv:2305.04388.
- Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2714–2730.
- Towards understanding chain-of-thought prompting: An empirical study of what matters.
- Pinto: Faithful language reasoning using prompt-generated rationales.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Chain of thought prompting elicits reasoning in large language models. In Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022).
- Automatic chain of thought prompting in large language models.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
- Verify-and-edit: A knowledge-enhanced chain-of-thought framework. arXiv preprint arXiv:2305.03268.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.