Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning (2402.14809v4)

Published 22 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The ability of LLMs to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs' abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning. Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing dynamic, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Evaluating large language models trained on code.
  6. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  7. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  9. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  10. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  11. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
  12. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
  13. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  14. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306.
  15. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations.
  16. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  17. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL).
  18. CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations.
  19. ToRA: A tool-integrated reasoning agent for mathematical problem solving. In The Twelfth International Conference on Learning Representations.
  20. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196.
  21. Measuring mathematical problem solving with the math dataset. NeurIPS.
  22. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations.
  23. Phi-2: The surprising power of small language models.
  24. Mistral 7b. arXiv preprint arXiv:2310.06825.
  25. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  26. Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation. arXiv preprint arXiv:2311.18702.
  27. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491.
  28. Generative judge for evaluating alignment. In The Twelfth International Conference on Learning Representations.
  29. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118.
  30. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
  31. Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL.
  32. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  33. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In International Conference on Learning Representations (ICLR).
  34. Critique ability of large language models.
  35. Sciagent: Tool-augmented language models for scientific reasoning. arXiv preprint arXiv:2402.11451.
  36. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  37. Cut the carp: Fishing for zero-shot story evaluation. arXiv preprint arXiv:2110.03111.
  38. Ambigqa: Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797.
  39. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  40. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  41. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014.
  42. Debagreement: A comment-reply dataset for (dis) agreement detection in online debates. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  43. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.
  44. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36.
  45. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057.
  46. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  47. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  48. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  49. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  50. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  51. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  52. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  54. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  55. Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592.
  56. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  57. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  58. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations.
  59. The generative AI paradox: “what it can create, it may not understand”. In The Twelfth International Conference on Learning Representations.
  60. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  61. Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post, May, 3.
  62. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15135–15153.
  63. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797.
  64. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  65. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zicheng Lin (7 papers)
  2. Zhibin Gou (15 papers)
  3. Tian Liang (50 papers)
  4. Ruilin Luo (9 papers)
  5. Haowei Liu (13 papers)
  6. Yujiu Yang (155 papers)
Citations (16)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com