Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Measuring Massive Multitask Language Understanding (2009.03300v3)

Published 7 Sep 2020 in cs.CY, cs.AI, cs.CL, and cs.LG

Abstract: We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Dan Hendrycks (63 papers)
  2. Collin Burns (11 papers)
  3. Steven Basart (16 papers)
  4. Andy Zou (23 papers)
  5. Mantas Mazeika (27 papers)
  6. Dawn Song (229 papers)
  7. Jacob Steinhardt (88 papers)
Citations (2,927)

Summary

  • The paper introduces a benchmark that measures NLP models' multitask accuracy across 57 diverse tasks including STEM, humanities, and law.
  • It reveals that large models like GPT-3 (175B parameters) average around 43.9% accuracy, far below the human expert benchmark of 89.8%.
  • The study identifies significant calibration issues, stressing the need for enhanced pretraining and fine-tuning to improve performance in sensitive domains.

Measuring Massive Multitask Language Understanding

The paper "Measuring Massive Multitask Language Understanding" introduces a new benchmark designed to evaluate the broad academic and professional knowledge of large pre-trained LLMs. Authored by Dan Hendrycks et al., it represents a comprehensive effort to profile the depth and range of knowledge in state-of-the-art NLP models.

Overview

The primary purpose of this benchmark is to assess how well current NLP models can perform across a diverse set of 57 tasks, spanning disciplines such as elementary mathematics, US history, computer science, law, and more. The test is designed to measure multitask accuracy, focusing on tasks that require substantial world knowledge and problem-solving abilities.

Most noticeably, the benchmark reveals that while recent models like GPT-3 show an average improvement of almost 20 percentage points over random chance performance, they still fall significantly short of expert-level accuracy across all tasks. Indeed, the findings suggest that these models exhibit uneven performance and often fail to recognize their own errors. Particularly concerning are the near-random accuracies in socially significant realms like law and morality.

Key Findings

Comparative Performance

  • Model Size and Accuracy: Smaller GPT-3 models (up to 13 billion parameters) hover around random chance accuracy, but the 175 billion-parameter GPT-3 model notably achieves an average accuracy of 43.9%. However, this is still substantially below human expert-level performance estimates, which approximate an accuracy of 89.8%.
  • Specialized Domains: The accuracy of the models varies widely across tasks. For instance, GPT-3 performs relatively well in certain humanities subjects, such as US Foreign Policy (~69%), compared to calculation-heavy STEM subjects like College Chemistry, where it has an accuracy of only 26%.
  • UnifiedQA Performance: Even with fewer parameters, the UnifiedQA model, fine-tuned on a rich set of datasets, performs exceptionally well, achieving 48.9% average accuracy. This outperformance underscores the importance of fine-tuning in addition to the model size.

Calibration Issues

One notable shortcoming is the considerable miscalibration of these models; GPT-3 often exhibits a significant discrepancy between its confidence predictions and actual accuracy, with deviations up to 24 percentage points. This lack of calibration is seen across different disciplines, suggesting a general reliability issue in confidence scores provided by the models.

Practical and Theoretical Implications

The results from this benchmark insinuate several practical and theoretical insights:

  1. Improving Pretrained Models: The uneven performance and near-random accuracy in certain subjects point to the necessity for more balanced pretraining datasets or enhanced architectures that can capture a wider array of knowledge both robustly and uniformly.
  2. Application in Sensitive Domains: The notably poor performance in legal and ethical subjects indicates that these models should not yet be trusted in applications that require nuanced understanding of human values or legal ramifications.
  3. Future Model Development: For these models to be practically useful in more sophisticated, real-world scenarios, better calibration and error estimation mechanisms are paramount. Moreover, enhancing models to effectively learn and apply procedural knowledge (e.g., in mathematics and STEM domains) is a critical next step.

Future Directions

Future research directions are likely to explore multimodal understanding, incorporating data beyond text to expand model capabilities in a more human-like learning paradigm. Additionally, the effectiveness of curriculum learning, where models are incrementally exposed to more complex tasks, could be a promising area of inquiry. The methodological shift proposed—evaluating models based on knowledge accrued from massive, diverse corpora as opposed to specific training datasets—also marks a significant departure that might set a new trend in NLP evaluations.

Conclusion

The introduction of this benchmark by Hendrycks et al. is a commendable stride in comprehensively evaluating the capabilities and limitations of modern NLP models. It serves not only to highlight current deficiencies but also provides a clear direction for future advancements. Models are shown to be making tangible progress yet remain far from attaining true expert-level accuracy or reliable self-assessment, pointing to an array of research opportunities and the continuing evolution of artificial intelligence and NLP systems.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews