SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation (2310.00074v3)
Abstract: To comprehensively gauge the capacity of current models for complex reasoning, it is crucial to assess their step-by-step reasoning in a scalable manner. Established reference-based evaluation metrics rely on human-annotated reasoning chains as references to assess the model-derived chains. However, such "gold-standard" human-written reasoning chains may not be unique and their acquisition is often labor-intensive. Existing reference-free reasoning evaluation metrics, while eliminating the need for human-crafted reasoning chains as references, often require fine-tuning with human-derived chains before evaluation, complicating the process and questioning their adaptability to other datasets. To address these challenges, we harness GPT-4 to automatically evaluate reasoning chain quality, thereby removing the dependency on human-written reasoning chains for both model fine-tuning and evaluative purposes. Leveraging the Socratic method, we develop SocREval ({\bf Soc}ratic Method-Inspired {\bf R}easoning {\bf Eval}uation), a novel approach for prompt design in reference-free reasoning evaluation. Empirical results from four human annotated datasets reveal that SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics. Beyond its demonstrated efficacy, SocREval, proves to be both cost-efficient and robust to prompt writing and example selection, as substantiated by our in-depth analysis.
- PaLM 2 technical report. arXiv preprint arXiv:2305.10403.
- Konstantine Arkoudas. 2023. GPT-4 can’t reason. arXiv preprint arXiv:2308.03762.
- Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- e-SNLI: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
- Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
- Edward Y Chang. 2023a. Prompting large language models with the socratic method. In 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC), pages 0351–0360. IEEE.
- Edward Y Chang. 2023b. Socrasynth: Socratic synthesis for reasoning and decision making. Stanford University InfoLab Technical Report.
- A training-free and reference-free summarization evaluation metric via centrality-weighted relevance and self-referenced redundancy. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 404–414.
- Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723.
- A study of automatic metrics for the evaluation of natural language explanations. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2376–2387.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370.
- Large language model for science: A study on P vs. NP. arXiv preprint arXiv:2309.05689.
- DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of NAACL-HLT, pages 2368–2378.
- GPTscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- Human-like summarization evaluation with ChatGPT. arXiv preprint arXiv:2304.02554.
- SUPERT: Towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1347–1354.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
- ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations.
- FOLIO: Natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840.
- Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023, pages 1049–1065. Association for Computational Linguistics (ACL).
- Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401.
- Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1266–1279.
- Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193–203.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- G-Eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522.
- Error analysis prompting enables human-like translation evaluation in large language models: A case study on ChatGPT. arXiv preprint arXiv:2303.13809.
- Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36.
- OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
- Semeval-2018 task 11: Machine comprehension using commonsense knowledge. In Proceedings of the 12th International Workshop on semantic evaluation, pages 747–757.
- ReCEval: Evaluating reasoning chains via correctness and informativeness. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10066–10086.
- Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In The Eleventh International Conference on Learning Representations.
- Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36.
- Robert H Somers. 1962. A new asymmetric measure of association for ordinal variables. American sociological review, pages 799–811.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Is ChatGPT a good NLG evaluator? a preliminary study. In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Naturalprover: Grounded mathematical proof generation with language models. Advances in Neural Information Processing Systems, 35:4913–4927.
- Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
- The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35:30378–30392.
- STaR: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.