Schroedinger's Threshold: When the AUC doesn't predict Accuracy (2404.03344v2)
Abstract: The Area Under Curve measure (AUC) seems apt to evaluate and compare diverse models, possibly without calibration. An important example of AUC application is the evaluation and benchmarking of models that predict faithfulness of generated text. But we show that the AUC yields an academic and optimistic notion of accuracy that can misalign with the actual accuracy observed in application, yielding significant changes in benchmark rankings. To paint a more realistic picture of downstream model performance (and prepare a model for actual application), we explore different calibration modes, testing calibration data and method.
- Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark. Transactions of the Association for Computational Linguistics, 10:1066β1083.
- SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391β409.
- Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214β2220, Florence, Italy. Association for Computational Linguistics.
- Tom Fawcett. 2006. An introduction to roc analysis. Pattern recognition letters, 27(8):861β874.
- Trueteacher: Learning factual consistency evaluation with large language models. arXiv preprint arXiv:2305.11171.
- DialFact: A benchmark for fact-checking in dialogue. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3785β3801, Dublin, Ireland. Association for Computational Linguistics.
- Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
- TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905β3920, Seattle, United States. Association for Computational Linguistics.
- q2superscriptπ2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7856β7870, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332β9346, Online. Association for Computational Linguistics.
- SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163β177.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906β1919, Online. Association for Computational Linguistics.
- Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, ICML β05, page 625β632, New York, NY, USA. Association for Computing Machinery.
- Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885β4901, Online. Association for Computational Linguistics.
- Juri Opitz. 2024. A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice. TACL (to appear).
- Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812β4829, Online. Association for Computational Linguistics.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311β318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- John Platt etΒ al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61β74.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1β67.
- BLEURT: learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7881β7892. Association for Computational Linguistics.
- Classifier calibration: a survey on how to assess and improve predicted class probabilities. Mach. Learn., 112(9):3211β3260.
- With a little push, NLI models can robustly and efficiently predict faithfulness. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 914β924, Toronto, Canada. Association for Computational Linguistics.
- Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008β5020, Online. Association for Computational Linguistics.
- Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817.
- Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27263β27277.
- Alignscore: Evaluating factual consistency with a unified alignment function. arXiv preprint arXiv:2305.16739.
- Fine-grained natural language inference based faithfulness evaluation for diverse summarisation tasks. In arxiv, EACL 2024 (to appear). Association for Computational Linguistics.
- Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298β1308, Minneapolis, Minnesota. Association for Computational Linguistics.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.