Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Large Language Model Capabilities for Conditional Generation (2306.16793v1)

Published 29 Jun 2023 in cs.CL

Abstract: Pre-trained LLMs (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of LLMs is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Findings of the 2021 conference on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88, Online. Association for Computational Linguistics.
  2. Language models are few-shot learners. ArXiv, abs/2005.14165.
  3. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311.
  4. Rethinking embedding coupling in pre-trained language models. In International Conference on Learning Representations.
  5. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
  6. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning.
  7. Ondrej Dusek and Filip Jurvc’ivcek. 2019. Neural generation for czech: Data and baselines.
  8. Semantic Noise Matters for Neural Natural Language Generation. In Proceedings of the 12th International Conference on Natural Language Generation (INLG 2019), pages 421–426, Tokyo, Japan.
  9. Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is in the eye of the user: A critique of NLP leaderboards. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4846–4853, Online. Association for Computational Linguistics.
  10. The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task Overview and Evaluation Results (WebNLG+ 2020). In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), Dublin/Virtual, Ireland.
  11. Human-paraphrased references improve neural machine translation. In Proceedings of the Fifth Conference on Machine Translation, pages 1183–1192, Online. Association for Computational Linguistics.
  12. Creating training corpora for nlg micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188. Association for Computational Linguistics.
  13. The gem benchmark: Natural language generation, its evaluation and metrics.
  14. Gemv2: Multilingual nlg benchmarking in a single line of code. arXiv preprint arXiv:2206.11249.
  15. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. ArXiv, abs/2202.06935.
  16. News summarization and evaluation in the era of GPT-3. CoRR, abs/2209.12356.
  17. Is machine translation getting better over time? In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 443–451, Gothenburg, Sweden. Association for Computational Linguistics.
  18. LongT5: Efficient text-to-text transformer for long sequences. CoRR, abs/2112.07916.
  19. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
  20. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 478–494, Online. Association for Computational Linguistics.
  21. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  22. WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4034–4048, Online. Association for Computational Linguistics.
  23. Evaluating human-language model interaction. CoRR, abs/2212.09746.
  24. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Annual Meeting of the Association for Computational Linguistics.
  25. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  26. Holistic evaluation of language models. ArXiv, abs/2211.09110.
  27. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  28. Results of the WMT20 metrics shared task. In Proceedings of the Fifth Conference on Machine Translation, pages 688–725, Online. Association for Computational Linguistics.
  29. Automatic construction of evaluation suites for natural language generation datasets. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  30. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
  31. The E2E dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206, Saarbrücken, Germany. Association for Computational Linguistics.
  32. Training language models to follow instructions with human feedback. CoRR, abs/2203.02155.
  33. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155.
  34. Totto: A controlled table-to-text generation dataset. ArXiv, abs/2004.14373.
  35. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  36. Learning compact metrics for MT. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 751–762, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  37. Scaling language models: Methods, analysis & insights from training gopher. ArXiv, abs/2112.11446.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  39. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  40. MLSUM: the multilingual summarization corpus. CoRR, abs/2004.14900.
  41. Bleurt: Learning robust metrics for text generation. In Annual Meeting of the Association for Computational Linguistics.
  42. C. Spearman. 1987. The proof and measurement of association between two things. by c. spearman, 1904. The American journal of psychology, 100 3-4:441–71.
  43. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615.
  44. Alan L. Stuart and Harry M. Markowitz. 1959. Portfolio selection: Efficient diversification of investments. A Quarterly Journal of Operations Research, 10:253.
  45. How to compare summarizers without target length? pitfalls, solutions and re-examination of the neural summarization literature. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 21–29, Minneapolis, Minnesota. Association for Computational Linguistics.
  46. Challenging big-bench tasks and whether chain-of-thought can solve them. CoRR, abs/2210.09261.
  47. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551.
  48. Lamda: Language models for dialog applications. ArXiv, abs/2201.08239.
  49. Superglue: A stickier benchmark for general-purpose language understanding systems. ArXiv, abs/1905.00537.
  50. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  51. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903.
  52. Johnny Wei and Robin Jia. 2021. The statistical advantage of automatic NLG metrics at the system level. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6840–6854, Online. Association for Computational Linguistics.
  53. Frank Wilcoxon. 1946. Individual comparisons of grouped data by ranking methods. Journal of economic entomology, 39(2):269–270.
  54. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  55. St-moe: Designing stable and transferable sparse expert models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Joshua Maynez (28 papers)
  2. Priyanka Agrawal (15 papers)
  3. Sebastian Gehrmann (48 papers)
Citations (24)