DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation (2405.15329v3)
Abstract: The acceleration of LLMs research has opened up new possibilities for evaluating generated texts. They serve as scalable and economical evaluators, but the question of how reliable these evaluators are has emerged as a crucial research question. Prior research efforts in the meta-evaluation of LLMs as judges limit the prompting of an LLM to a single use to obtain a final evaluation decision. They then compute the agreement between LLMs' outputs and human labels. This lacks interpretability in understanding the evaluation capability of LLMs. In light of this challenge, we propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices. Our experiments illustrate that it not only provides a more interpretable window for how well LLMs evaluate, but also leads to improvements up to 39.6% for different LLMs on a variety of meta-evaluation benchmarks.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
- Susan M Brookhart. 2018. Appropriate criteria: Key to effective rubrics. In Frontiers in Education, volume 3, page 22. Frontiers Media SA.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
- Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723.
- Can large language models be trusted for evaluation? scalable meta-evaluation of llms as evaluators via agent debate. arXiv preprint arXiv:2401.16788.
- Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
- Pauline Dickinson and Jeffery Adams. 2017. Values in evaluation–the use of rubrics. Evaluation and program planning, 65:113–116.
- Comparing top k lists. SIAM Journal on discrete mathematics, 17(1):134–160.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- Peiyuan Gong and Jiaxin Mao. 2023. Coascore: Chain-of-aspects prompting for nlg evaluation. arXiv preprint arXiv:2312.10355.
- Anders Jonsson and Gunilla Svingby. 2007. The use of scoring rubrics: Reliability, validity and educational consequences. Educational research review, 2(2):130–144.
- The perils of using mechanical turk to evaluate open-ended text generation. arXiv preprint arXiv:2109.06835.
- Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
- Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520.
- Hurdles to progress in long-form question answering. arXiv preprint arXiv:2103.06332.
- A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. arXiv preprint arXiv:2305.18486.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- Aligning with human judgement: The role of pairwise preference in large language model evaluators. arXiv preprint arXiv:2403.16950.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
- Branch-solve-merge improves large language model evaluation and generation.
- On the evaluation metrics for paraphrase generation. arXiv preprint arXiv:2202.08479.
- Hong Sun and Ming Zhou. 2012. Joint learning of a dual smt system for paraphrase generation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 38–42.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
- Chain-of-thought prompting elicits reasoning in large language models.
- Minghao Wu and Alham Fikri Aji. 2023. Style over substance: Evaluation biases for large language models. arXiv preprint arXiv:2307.03025.
- Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
- Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641.
- Evaluating and improving tool-augmented computation-intensive math reasoning. Advances in Neural Information Processing Systems, 36.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
- Minzhi Li (8 papers)
- Zhengyuan Liu (41 papers)
- Shumin Deng (65 papers)
- Shafiq Joty (187 papers)
- Nancy F. Chen (97 papers)
- Min-Yen Kan (92 papers)