Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models (2404.11500v1)

Published 17 Apr 2024 in cs.CL and cs.AI

Abstract: This paper studies the relationship between the surface form of a mathematical problem and its solvability by LLMs. We find that subtle alterations in the surface form can significantly impact the answer distribution and the solve rate, exposing the LLM's lack of robustness and sensitivity to the surface form in reasoning through complex problems. To improve mathematical reasoning performance, we propose Self-Consistency-over-Paraphrases (SCoP), which diversifies reasoning paths from specific surface forms of the problem. We evaluate our approach on four mathematics reasoning benchmarks over three LLMs and show that SCoP improves mathematical reasoning performance over vanilla self-consistency, particularly for problems initially deemed unsolvable. Finally, we provide additional experiments and discussion regarding problem difficulty and surface forms, including cross-model difficulty agreement and paraphrasing transferability, and Variance of Variations (VOV) for LLM evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Graph of thoughts: Solving elaborate problems with large language models.
  2. Rahul Bhagat and Eduard Hovy. 2013. What Is a Paraphrase? Computational Linguistics, 39(3):463–472.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  5. Demystifying prompts in language models via perplexity estimation.
  6. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  7. Measuring mathematical problem solving with the math dataset. NeurIPS.
  8. Daniel Kahneman. 2011. Thinking, fast and slow. macmillan.
  9. Large language models are zero-shot reasoners.
  10. Holistic evaluation of language models.
  11. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
  12. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
  13. Chameleon: Plug-and-play compositional reasoning with large language models.
  14. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  15. OpenAI. 2023. Gpt-4 technical report.
  16. Training language models to follow instructions with human feedback.
  17. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597.
  18. Scaling language models: Methods, analysis and insights from training gopher.
  19. Autoprompt: Eliciting knowledge from language models with automatically generated prompts.
  20. Joint prompt optimization of stacked llms using variational inference.
  21. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  22. Paraphrasing interventions and problem-solving accuracy: Do generative procedures help english language learners with math difficulties? Learning Disabilities Research & Practice, 34(2):68–84.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  24. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  25. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  26. Tree of thoughts: Deliberate problem solving with large language models.
  27. The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35:30378–30392.
  28. Large language models are human-level prompt engineers.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.