Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks (2310.00752v4)

Published 1 Oct 2023 in cs.CL and cs.AI

Abstract: We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output $\rightarrow$ error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors. To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8\% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task. All the resourced are released in our project website: \url{https://tiger-ai-lab.github.io/TIGERScore/}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  2. Re-evaluating the role of Bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pp.  249–256, Trento, Italy, April 2006. Association for Computational Linguistics. URL https://aclanthology.org/E06-1032.
  3. Scaling instruction-finetuned language models, 2022.
  4. Training verifiers to solve math word problems. ArXiv, abs/2110.14168, 2021. URL https://api.semanticscholar.org/CorpusID:239998651.
  5. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  452–457, Baltimore, Maryland, June 2014. Association for Computational Linguistics. doi: 10.3115/v1/P14-2074. URL https://aclanthology.org/P14-2074.
  6. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409, 2021.
  7. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3558–3567, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1346. URL https://aclanthology.org/P19-1346.
  8. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474, 2021. URL https://api.semanticscholar.org/CorpusID:233444275.
  9. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.  46–68, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.2.
  10. Gptscore: Evaluate as you desire, 2023.
  11. Openmeva: A benchmark for evaluating open-ended story generation metrics. arXiv preprint arXiv:2105.08920, 2021.
  12. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In Conference on Empirical Methods in Natural Language Processing, 2019. URL https://api.semanticscholar.org/CorpusID:202540590.
  13. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp.  944–952, Cambridge, MA, October 2010. Association for Computational Linguistics. URL https://aclanthology.org/D10-1092.
  14. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  14165–14178, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.792. URL https://aclanthology.org/2023.acl-long.792.
  15. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  16. Towards explainable evaluation metrics for natural language generation. ArXiv, abs/2203.11131, 2022. URL https://api.semanticscholar.org/CorpusID:247594648.
  17. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  18. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  19. Brio: Bringing order to abstractive summarization. arXiv preprint arXiv:2203.16804, 2022.
  20. Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. ArXiv, abs/2303.13809, 2023. URL https://api.semanticscholar.org/CorpusID:257756967.
  21. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  22. FeTaQA: Free-form table question answering. Transactions of the Association for Computational Linguistics, 10:35–49, 2022a. doi: 10.1162/tacl˙a˙00446. URL https://aclanthology.org/2022.tacl-1.3.
  23. Fetaqa: Free-form table question answering. Transactions of the Association for Computational Linguistics, 10:35–49, 2022b.
  24. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  25. Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics, 2002a. URL https://api.semanticscholar.org/CorpusID:11080756.
  26. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002b.
  27. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  2685–2702, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.213. URL https://aclanthology.org/2020.emnlp-main.213.
  28. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.  578–585, Abu Dhabi, United Arab Emirates (Hybrid), December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.52.
  29. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.  634–645, Abu Dhabi, United Arab Emirates (Hybrid), December 2022b. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.60.
  30. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2021.
  31. Bleurt: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7881–7892, 2020.
  32. ASQA: Factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  8273–8288, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.566. URL https://aclanthology.org/2022.emnlp-main.566.
  33. Asqa: Factoid questions meet long-form answers. arXiv preprint arXiv:2204.06092, 2022b.
  34. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  35. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  36. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. ArXiv, abs/2306.05087, 2023a. URL https://api.semanticscholar.org/CorpusID:259108266.
  37. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, 2022.
  38. Self-instruct: Aligning language model with self generated instructions. The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023b. URL https://aclanthology.org/2023.acl-long.754.pdf.
  39. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  40. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023a.
  41. SESCORE2: Learning text generation evaluation via synthesizing realistic mistakes. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5166–5183, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.283. URL https://aclanthology.org/2023.acl-long.283.
  42. Instructscore: Towards explainable text generation evaluation with automatic feedback, 2023c.
  43. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
  44. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
  45. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  46. Towards a unified multi-dimensional evaluator for text generation. In Conference on Empirical Methods in Natural Language Processing, 2022a. URL https://api.semanticscholar.org/CorpusID:252873117.
  47. Towards a unified multi-dimensional evaluator for text generation, 2022b.
  48. Lima: Less is more for alignment, 2023.
  49. Webnlg challenge 2020: Language agnostic delexicalisation for multilingual rdf-to-text generation. In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), pp.  186–191, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Dongfu Jiang (14 papers)
  2. Yishan Li (9 papers)
  3. Ge Zhang (170 papers)
  4. Wenhao Huang (98 papers)
  5. Bill Yuchen Lin (72 papers)
  6. Wenhu Chen (134 papers)
Citations (40)
X Twitter Logo Streamline Icon: https://streamlinehq.com