DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models (2401.02132v1)
Abstract: Evaluating the quality and variability of text generated by LLMs poses a significant, yet unresolved research challenge. Traditional evaluation methods, such as ROUGE and BERTScore, which measure token similarity, often fail to capture the holistic semantic equivalence. This results in a low correlation with human judgments and intuition, which is especially problematic in high-stakes applications like healthcare and finance where reliability, safety, and robust decision-making are highly critical. This work proposes DCR, an automated framework for evaluating and improving the consistency of LLM-generated texts using a divide-conquer-reasoning approach. Unlike existing LLM-based evaluators that operate at the paragraph level, our method employs a divide-and-conquer evaluator (DCE) that breaks down the paragraph-to-paragraph comparison between two generated responses into individual sentence-to-paragraph comparisons, each evaluated based on predefined criteria. To facilitate this approach, we introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Beyond the consistency evaluation, we further present a reason-assisted improver (RAI) that leverages the analytical reasons with explanations identified by DCE to generate new responses aimed at reducing these inconsistencies. Through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating the consistency of LLM generation across multiple benchmarks in semantic, factual, and summarization consistency tasks. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
- Yuan Zhang amd Jason Baldridge and Luheng He. Paws: Paraphrase adversaries from word scrambling. arXiv preprint arXiv:1904.01130, 2019.
- Smart: Sentences as basic units for text evaluation. In The Eleventh International Conference on Learning Representations, 2022.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Two failures of self-consistency in the multi-step reasoning of llms. arXiv preprint arXiv:2305.14279, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019.
- Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495, 2023.
- Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. arXiv preprint arXiv:2005.03754, 2020.
- Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031, 2021.
- Summeval: Re-evaluating summarization evaluation. arXiv preprint arXiv:2007.12626, 2021.
- A fine-grained analysis of bertscore. In Proceedings of the Sixth Conference on Machine Translation, pp. 507–517, 2021.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28, 2015.
- First quora dataset release: Question pairs. 2017.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- Evaluating open-domain question answering in the era of large language models. arXiv preprint arXiv:2305.06984, 2023.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. arXiv e-prints, pp. arXiv–2305, 2023.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
- Reta-llm: A retrieval-augmented large language model toolkit. arXiv preprint arXiv:2306.05212, 2023a.
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023b.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
- Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852, 2023.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002a.
- Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318., 2002b.
- Maja Popović. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the tenth workshop on statistical machine translation, pp. 392–395, 2015.
- A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
- Eed: Extended edit distance measure for machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 514–520, 2019.
- Beer: Better evaluation as ranking. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 414–419, 2014.
- Evaluating the factual consistency of large language models through summarization. arXiv preprint arXiv:2211.08412, 2022.
- Asking and answering questions to evaluate the factual consistency of summaries. arXiv preprint arXiv:2004.04228, 2020.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023.
- Character: Translation edit rate on character level. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp. 505–510, 2016.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712, 2023.
- Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
- Sac 33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023a.
- How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023b.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2020.
- Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622, 2019.
- Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197, 2022.
- Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439, 2023.
- Wendi Cui (8 papers)
- Jiaxin Zhang (105 papers)
- Zhuohang Li (24 papers)
- Lopez Damien (1 paper)
- Kamalika Das (19 papers)
- Bradley Malin (22 papers)
- Sricharan Kumar (11 papers)