Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GLGE: A New General Language Generation Evaluation Benchmark (2011.11928v3)

Published 24 Nov 2020 in cs.CL

Abstract: Multi-task benchmarks such as GLUE and SuperGLUE have driven great progress of pretraining and transfer learning in NLP. These benchmarks mostly focus on a range of Natural Language Understanding (NLU) tasks, without considering the Natural Language Generation (NLG) models. In this paper, we present the General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks. For each task, we continue to design three subtasks in terms of task difficulty (GLGE-Easy, GLGE-Medium, and GLGE-Hard). This introduces 24 subtasks to comprehensively compare model performance. To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines including MASS, BART, and ProphetNet (The source code and dataset are publicly available at https://github.com/microsoft/glge).

GLGE: A New General Language Generation Evaluation Benchmark

In the field of NLP, evaluating the performance and generalization capabilities of LLMs remains a formidable challenge. The paper "GLGE: A New General Language Generation Evaluation Benchmark" introduces the General Language Generation Evaluation (GLGE) benchmark, specifically designed to assess Natural Language Generation (NLG) capabilities across a diverse set of tasks. While the benchmarks such as GLUE and SuperGLUE have traditionally focused on Natural Language Understanding (NLU), GLGE addresses the need for a comprehensive evaluation framework for NLG models.

Benchmark Design and Principles

GLGE distinguishes itself by embracing task diversity and difficulty within its framework. It comprises eight NLG tasks, ranging from text summarization to dialogue management. To cater to varied evaluation needs, GLGE includes three levels of difficulty—GLGE-Easy, GLGE-Medium, and GLGE-Hard. The construction of GLGE tasks is grounded in principles such as diversity in task formats, controlled task difficulty, ease of automatic evaluation, and utilization of popular datasets. By incorporating widely-accepted datasets and introducing new real-world datasets, GLGE ensures both relevance and rigor.

Task and Dataset Composition

The GLGE benchmark employs a selection of widely-used and novel datasets for its eight evaluation tasks, including:

  • Text Summarization: GLGE features datasets such as CNN/DailyMail and Gigaword.
  • Answer-aware Question Generation: Utilizing SQuAD 1.1 and a newly introduced dataset MSQG.
  • Conversational Question Answering: Deployment of the CoQA dataset.
  • Personalizing Dialogue: Assessment using PersonaChat.

These tasks are characterized by their distinct input-output configurations, offering a comprehensive platform to evaluate various facets of NLG models.

Evaluation Metrics and Overall Score

To facilitate objective comparison, GLGE employs standardized metrics like ROUGE, BLEU, METEOR, and F1-score across its tasks. The benchmark also defines an overall scoring mechanism that combines individual task scores, providing a single, aggregate performance measure.

Baseline Models and Results

The paper evaluates multiple baselines—non-pretrained models like vanilla LSTM Seq2Seq and Transformer, as well as several pretrained models such as MASS, BART, and ProphetNet. The results highlight a significant performance gap between pretrained and non-pretrained models. Larger pretrained models like BART\textsubscript{large} and ProphetNet\textsubscript{large} achieve competitive scores, underscoring the effectiveness of model pretraining.

Implications and Future Directions

The introduction of GLGE represents a meaningful step toward a unified evaluation standard for NLG tasks. Practically, it encourages improved model designs by providing diverse and challenging benchmarks. Theoretically, it facilitates understanding of model capabilities across varied NLG tasks. As researchers continue to evolve AI technologies, GLGE offers a pivotal resource for developing and assessing the next generation of NLG models.

Future work may focus on integrating advanced metrics such as BERTscore and BLEURT for richer evaluation, alongside exploring correlations with human judgment. The collaborative aspect of GLGE through a public leaderboard invites ongoing contributions from the research community, promoting transparency and progression in model evaluation standards.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (18)
  1. Dayiheng Liu (75 papers)
  2. Yu Yan (54 papers)
  3. Yeyun Gong (78 papers)
  4. Weizhen Qi (15 papers)
  5. Hang Zhang (164 papers)
  6. Jian Jiao (44 papers)
  7. Weizhu Chen (128 papers)
  8. Jie Fu (229 papers)
  9. Linjun Shou (53 papers)
  10. Ming Gong (246 papers)
  11. Pengcheng Wang (25 papers)
  12. Jiusheng Chen (8 papers)
  13. Daxin Jiang (138 papers)
  14. Jiancheng Lv (99 papers)
  15. Ruofei Zhang (24 papers)
  16. Winnie Wu (2 papers)
  17. Ming Zhou (182 papers)
  18. Nan Duan (172 papers)
Citations (64)
Github Logo Streamline Icon: https://streamlinehq.com