Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework (2410.18653v3)
Abstract: Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) LLMs. However, evaluating the quality of these models and the employed decoding strategies remains challenging due to trade-offs among widely used metrics such as coherence, diversity, and perplexity. This paper addresses the specific problem of multicriteria evaluation for open-ended text generation, proposing novel methods for both relative and absolute rankings of decoding methods. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Our experiments demonstrate that the proposed approaches offer a robust way to compare decoding strategies and serve as valuable tools to guide model selection for open-ended text generation tasks. We suggest future directions for improving evaluation methodologies in text generation and make our code, datasets, and models publicly available.
- A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169.
- Jointly measuring diversity and quality in text generation models. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 90–98, Minneapolis, Minnesota. Association for Computational Linguistics.
- Text generation: A systematic literature review of tasks, evaluation, and challenges. Preprint, arXiv:2405.15604.
- Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. Journal of Machine Learning Research, 18(77):1–36.
- Hannah Blocher and Georg Schollmeyer. 2024. Data depth functions for non-standard data by use of formal concept analysis. arXiv preprint arXiv:2402.16560.
- Comparing machine learning algorithms by union-free generic depth. International Journal of Approximate Reasoning, 169:109166.
- R. Bradley and M. Terry. 1952a. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Ralph Allan Bradley and Milton E Terry. 1952b. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Evaluation of text generation: A survey. Preprint, arXiv:2006.14799.
- Evaluating language models as risk scores. arXiv preprint arXiv:2407.14614.
- R. Davidson. 1970. On extending the Bradley-Terry model to accommodate ties in paired comparison experiments. Journal of the American Statistical Association, 65:317–328.
- Deepseek llm: Scaling open-source language models with longtermism. Preprint, arXiv:2401.02954.
- J. Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30.
- The llama 3 herd of models. Preprint, arXiv:2407.21783.
- Domain-based benchmark experiments: Exploratory and inferential analysis. Austrian Journal of Statistics, 41(1):5–26.
- Hierarchical neural story generation. Preprint, arXiv:1805.04833.
- Markus Freitag and Yaser Al-Onaizan. 2017. Beam search strategies for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation. Association for Computational Linguistics.
- Simcse: Simple contrastive learning of sentence embeddings. Preprint, arXiv:2104.08821.
- Decoding decoded: Understanding hyperparameter effects in open-ended text generation. Preprint, arXiv:2410.06097.
- Adaptive contrastive search: Uncertainty-guided decoding for open-ended text generation. Preprint, arXiv:2407.18698.
- S. García and F. Herrera. 2008. An extension on “Statistical comparisons of classifiers over multiple cata sets” for all pairwise comparisons. Journal of Machine Learning Research, 9:2677–2694.
- Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10):2044–2064.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
- The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3):675–699.
- Statistical comparisons of classifiers by generalized stochastic dominance. Journal of Machine Learning Research, 24(231):1–37.
- Robust statistical comparison of random variables with locally varying scale of measurement. In Uncertainty in Artificial Intelligence, pages 941–952. PMLR.
- Statistical multicriteria benchmarking via the GSD-front. Advances in Neural Information Processing Systems (forthcoming).
- Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63.
- Mistral 7b. Preprint, arXiv:2310.06825.
- Efficient multi-criteria optimization on noisy machine learning problems. Applied Soft Computing, 29:357–370.
- Towards quantifying the effect of datasets for benchmarking: A look at tabular machine learning.
- Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 35:34586–34599.
- Contrastive decoding: Open-ended text generation as optimization. Preprint, arXiv:2210.15097.
- Falcon2-11b technical report. Preprint, arXiv:2407.14885.
- Locally typical sampling. Preprint, arXiv:2202.00666.
- Pointer sentinel mixture models. Preprint, arXiv:1609.07843.
- Analyzing the BBOB results by means of benchmarking concepts. Evolutionary Computation, 23:161–185.
- The support vector machine under test. Neurocomputing, 55(1):169–186.
- Mapping global dynamics of benchmark creation and saturation in artificial intelligence. Nature Communications, 13(1):6793.
- Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34:4816–4828.
- Language models are unsupervised multitask learners.
- Julian Rodemann and Hannah Blocher. 2024. Partial rankings of optimizers. In International Conference on Learning Representations (ICLR), Tiny Papers Track.
- A meta-analysis of overfitting in machine learning. Advances in Neural Information Processing Systems, 32.
- DeepOBS: A deep learning optimizer benchmark suite. In International Conference on Learning Representations.
- A theory of dynamic benchmarks. In The Eleventh International Conference on Learning Representations.
- Yixuan Su and Nigel Collier. 2023. Contrastive search is what you need for neural text generation. Preprint, arXiv:2210.14140.
- A contrastive framework for neural text generation. Preprint, arXiv:2202.06417.
- Yixuan Su and Jialu Xu. 2022. An empirical study on contrastive search and contrastive decoding for open-ended text generation. Preprint, arXiv:2211.10797.
- Scientific machine learning benchmarks. Nature Reviews Physics, 4(6):413–420.
- Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60.
- Qwen2 technical report. Preprint, arXiv:2407.10671.
- G. Zhang and M. Hardt. 2024a. Inherent trade-offs between diversity and stability in multi-task benchmark. Preprint, arXiv:2405.01719.
- Guanhua Zhang and Moritz Hardt. 2024b. Inherent trade-offs between diversity and stability in multi-task benchmarks. In International Conference on Machine Learning.
- Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering, 48(1):1–36.
- Opt: Open pre-trained transformer language models. Preprint, arXiv:2205.01068.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. Preprint, arXiv:1506.06724.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.