CriticEval: Evaluating Large Language Model as Critic (2402.13764v5)
Abstract: Critique ability, i.e., the capability of LLMs to identify and rectify flaws in responses, is crucial for their applications in self-improvement and scalable oversight. While numerous studies have been proposed to evaluate critique ability of LLMs, their comprehensiveness and reliability are still limited. To overcome this problem, we introduce CriticEval, a novel benchmark designed to comprehensively and reliably evaluate critique ability of LLMs. Specifically, to ensure the comprehensiveness, CriticEval evaluates critique ability from four dimensions across nine diverse task scenarios. It evaluates both scalar-valued and textual critiques, targeting responses of varying quality. To ensure the reliability, a large number of critiques are annotated to serve as references, enabling GPT-4 to evaluate textual critiques reliably. Extensive evaluations of open-source and closed-source LLMs first validate the reliability of evaluation in CriticEval. Then, experimental results demonstrate the promising potential of open-source LLMs, the effectiveness of critique datasets and several intriguing relationships between the critique ability and some critical factors, including task types, response qualities and critique dimensions.
- MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Constitutional ai: Harmlessness from ai feedback.
- Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Measuring progress on scalable oversight for large language models.
- Evaluating large language models trained on code.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.
- TheoremQA: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, Singapore. Association for Computational Linguistics.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Ultrafeedback: Boosting language models with high-quality feedback.
- DeepSeek-AI. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954.
- Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR.
- The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation, pages 1066–1083, Singapore. Association for Computational Linguistics.
- Gptscore: Evaluate as you desire.
- CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations.
- Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Mixtral of experts.
- Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation.
- Prometheus: Inducing evaluation capability in language models. In The Twelfth International Conference on Learning Representations.
- Pone: A novel automatic evaluation metric for open-domain generative dialogue systems. ACM Trans. Inf. Syst., 39(1).
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
- Generative judge for evaluating alignment. In The Twelfth International Conference on Learning Representations.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL.
- RLTF: Reinforcement learning from unit test feedback. Transactions on Machine Learning Research.
- Alignbench: Benchmarking chinese alignment of large language models.
- G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
- Self-refine: Iterative refinement with self-feedback.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
- Numglue: A suite of fundamental yet challenging mathematical reasoning tasks. ACL.
- Can foundation models label data like humans? Hugging Face Blog. Https://huggingface.co/blog/llm-v-human-data.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback.
- Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies.
- Self-critiquing models for assisting human evaluators.
- Findings of the WMT 2020 shared task on quality estimation. In Proceedings of the Fifth Conference on Machine Translation, pages 743–764, Online. Association for Computational Linguistics.
- GPT-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems. In NeurIPS 2023 Foundation Models for Decision Making Workshop.
- Learning to summarize from human feedback. In NeurIPS.
- The critique of critique.
- SALMON: Self-alignment with principle-following reward models. In The Twelfth International Conference on Learning Representations.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. In Thirty-seventh Conference on Neural Information Processing Systems.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. In AAAI Conference on Artificial Intelligence.
- InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
- Llama 2: Open foundation and fine-tuned chat models.
- Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. CoRR, abs/2209.02970.
- A survey on large language model based autonomous agents.
- Large language models are not fair evaluators.
- Shepherd: A critic for language model generation.
- How far can camels go? exploring the state of instruction tuning on open resources. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
- Chain-of-thought prompting elicits reasoning in large language models.
- WizardLM: Empowering large pre-trained language models to follow complex instructions. In The Twelfth International Conference on Learning Representations.
- Reasons to reject? aligning language models with judgments.
- ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR).
- Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post.
- Self-rewarding language models.
- Jerrold Zar. 2005. Spearman Rank Correlation, volume 5.
- Evaluating large language models at evaluating instruction following. In The Twelfth International Conference on Learning Representations.
- Self-contrast: Better reflection through inconsistent solving perspectives.
- Siren’s song in the ai ocean: A survey on hallucination in large language models.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Tian Lan (162 papers)
- Wenwei Zhang (77 papers)
- Chen Xu (186 papers)
- Heyan Huang (107 papers)
- Dahua Lin (336 papers)
- Kai Chen (512 papers)
- Xian-Ling Mao (76 papers)