Towards A Unified View of Answer Calibration for Multi-Step Reasoning (2311.09101v3)
Abstract: LLMs employing Chain-of-Thought (CoT) prompting have broadened the scope for improving multi-step reasoning capabilities. We generally divide multi-step reasoning into two phases: path generation to generate the reasoning path(s); and answer calibration post-processing the reasoning path(s) to obtain a final answer. However, the existing literature lacks systematic analysis on different answer calibration approaches. In this paper, we summarize the taxonomy of recent answer calibration techniques and break them down into step-level and path-level strategies. We then conduct a thorough evaluation on these strategies from a unified view, systematically scrutinizing step-level and path-level answer calibration across multiple paths. Experimental results reveal that integrating the dominance of both strategies tends to derive optimal outcomes. Our study holds the potential to illuminate key insights for optimizing multi-step reasoning with answer calibration.
- Aman Madaan Pranjal Aggarwal and Mausam Yiming Yang. 2023. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms. In EMNLP, pages 12375–12396. Association for Computational Linguistics.
- Large language models for mathematical reasoning: Progresses and challenges. CoRR, abs/2402.00157.
- Rest meets react: Self-improvement for multi-step reasoning llm agent. CoRR, abs/2312.10003.
- Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In NAACL-HLT (1), pages 2357–2367. Association for Computational Linguistics.
- Knowledge-augmented language model verification. In EMNLP, pages 1720–1736. Association for Computational Linguistics.
- Graph of thoughts: Solving elaborate problems with large language models. CoRR, abs/2308.09687.
- When do program-of-thoughts work for reasoning? In AAAI. AAAI Press.
- Lang Cao. 2023. Enhancing reasoning capabilities of large language models: A graph-based verification approach. CoRR, abs/2308.09267.
- Fireact: Toward language agent fine-tuning. CoRR, abs/2310.05915.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- Complexity-based prompting for multi-step reasoning. In ICLR. OpenReview.net.
- PAL: program-aided language models. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 10764–10799. PMLR.
- ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations.
- On calibration of modern neural networks. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR.
- Reasoning with language model is planning with world model. In EMNLP, pages 8154–8173. Association for Computational Linguistics.
- Large language models are reasoning teachers. In ACL (1), pages 14852–14882. Association for Computational Linguistics.
- Large language models can self-improve. CoRR, abs/2210.11610.
- Resprompt: Residual connection prompting advances multi-step reasoning in large language models. CoRR, abs/2310.04743.
- Forward-backward reasoning in large language models for verification. CoRR, abs/2308.07758.
- Zhanming Jie and Wei Lu. 2023. Leveraging training data in few-shot prompting for numerical reasoning. In ACL (Findings), pages 10518–10526. Association for Computational Linguistics.
- Design of chain-of-thought in math problem solving. CoRR, abs/2309.11054.
- Ziqi Jin and Wei Lu. 2023. Tab-cot: Zero-shot tabular chain of thought. In ACL (Findings), pages 10259–10277. Association for Computational Linguistics.
- Large language models are zero-shot reasoners. In NeurIPS.
- Better zero-shot reasoning with role-play prompting. CoRR, abs/2308.07702.
- Learning to reason and memorize with self-notes. CoRR, abs/2305.00833.
- Soochan Lee and Gunhee Kim. 2023. Recursion of thought: A divide-and-conquer approach to multi-context reasoning with language models. In ACL (Findings), pages 623–658. Association for Computational Linguistics.
- IokTong Lei and Zhidong Deng. 2023. Selfzcot: a self-prompt zero-shot cot from semantic-level to code-level for a better utilization of llms. CoRR, abs/2305.11461.
- Xiaonan Li and Xipeng Qiu. 2023. Mot: Memory-of-thought enables chatgpt to self-improve. In EMNLP, pages 6354–6374. Association for Computational Linguistics.
- Chain of knowledge: A framework for grounding large language models with structured knowledge bases. CoRR, abs/2305.13269.
- Making language models better reasoners with step-aware verifier. In ACL (1), pages 5315–5333. Association for Computational Linguistics.
- Mint: Boosting generalization in mathematical reasoning via multi-view fine-tuning. CoRR, abs/2307.07951.
- Let’s verify step by step. CoRR, abs/2305.20050.
- Deductive verification of chain-of-thought reasoning. In NeurIPS.
- Chain of hindsight aligns language models with feedback. CoRR, abs/2302.02676.
- Jieyi Long. 2023. Large language model guided tree-of-thought. CoRR, abs/2305.08291.
- Self-refine: Iterative refinement with self-feedback. CoRR, abs/2303.17651.
- Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. CoRR, abs/2308.00436.
- Michalis Mountantonakis and Yannis Tzitzikas. 2023. Using multiple RDF knowledge graphs for enriching chatgpt responses. In ECML/PKDD (7), volume 14175 of Lecture Notes in Computer Science, pages 324–329. Springer.
- Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In ICML, volume 119 of ACM International Conference Proceeding Series, pages 625–632. ACM.
- OpenAI. 2023. GPT-4 technical report. OpenAI.
- Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. CoRR, abs/2308.03188.
- Are NLP models really able to solve simple math word problems? In NAACL-HLT, pages 2080–2094. Association for Computational Linguistics.
- REFINER: reasoning feedback on intermediate representations. CoRR, abs/2304.01904.
- Reasoning with language model prompting: A survey. In ACL (1), pages 5368–5393. Association for Computational Linguistics.
- Language models are unsupervised multitask learners. OpenAI.
- Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In EMNLP, pages 1743–1752. The Association for Computational Linguistics.
- Algorithm of thoughts: Enhancing exploration of ideas in large language models. CoRR, abs/2308.10379.
- Prompt space optimizing few-shot reasoning success with large language models. CoRR, abs/2306.03799.
- Reflexion: Language agents with verbal reinforcement learning. In NeurIPS.
- PEARL: prompting large language models to plan and execute actions over long documents. CoRR, abs/2305.14564.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. In NAACL-HLT (1), pages 4149–4158. Association for Computational Linguistics.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In EMNLP, pages 5433–5442. Association for Computational Linguistics.
- Better zero-shot reasoning with self-adaptive prompting. In ACL (Findings), pages 3493–3514. Association for Computational Linguistics.
- Towards understanding chain-of-thought prompting: An empirical study of what matters. In ACL (1), pages 2717–2739. Association for Computational Linguistics.
- Boosting language models reasoning with chain-of-knowledge prompting. CoRR, abs/2306.06427.
- Self-prompted chain-of-thought on large language models for open-domain multi-hop reasoning. In EMNLP (Findings), pages 2717–2731. Association for Computational Linguistics.
- Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. CoRR, abs/2310.03731.
- Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering. CoRR, abs/2308.13259.
- A survey on large language model based autonomous agents. CoRR, abs/2308.11432.
- Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In ACL (1), pages 2609–2634. Association for Computational Linguistics.
- SCOTT: self-consistent chain-of-thought distillation. In ACL (1), pages 5546–5558. Association for Computational Linguistics.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- Meta-reasoning: Semantics-symbol deconstruction for large language models. CoRR, abs/2306.17820.
- Interactive natural language processing. CoRR, abs/2305.13246.
- Democratizing reasoning ability: Tailored learning from large language model. In EMNLP, pages 1948–1966. Association for Computational Linguistics.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
- Large language models are better reasoners with self-verification. In EMNLP (Findings), pages 2550–2575. Association for Computational Linguistics.
- Get an a in math: Progressive rectification prompting. In AAAI. AAAI Press.
- The rise and potential of large language model based agents: A survey. CoRR, abs/2309.07864.
- Self-polish: Enhance reasoning in large language models via problem refinement. CoRR, abs/2305.14497.
- Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In ICLR. OpenReview.net.
- Re-reading improves reasoning in language models. CoRR, abs/2309.06275.
- RCOT: detecting and rectifying factual inconsistency in reasoning by reversing chain-of-thought. CoRR, abs/2305.11499.
- Thinking like an expert: Multimodal hypergraph-of-thought (hot) reasoning to boost foundation modals. CoRR, abs/2308.06207.
- Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS.
- React: Synergizing reasoning and acting in language models. In ICLR. OpenReview.net.
- Retroformer: Retrospective large language agents with policy gradient optimization. CoRR, abs/2308.02151.
- Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. CoRR, abs/2305.16582.
- Answering questions by meta-reasoning over multiple chains of thought. In EMNLP, pages 5942–5966. Association for Computational Linguistics.
- Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653.
- STar: Bootstrapping reasoning with reasoning. In NeurIPS.
- Interpretable math word problem solution generation via step-by-step planning. In ACL (1), pages 6858–6877. Association for Computational Linguistics.
- Verify-and-edit: A knowledge-enhanced chain-of-thought framework. In ACL (1), pages 5823–5840. Association for Computational Linguistics.
- A survey of large language models. CoRR, abs/2303.18223.
- Progressive-hint prompting improves reasoning in large language models. CoRR, abs/2304.09797.
- Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. CoRR, abs/2308.07921.