Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Do You Need Billions of Words of Pretraining Data? (2011.04946v1)

Published 10 Nov 2020 in cs.CL

Abstract: NLP is currently dominated by general-purpose pretrained LLMs like RoBERTa, which achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? We adopt four probing methods---classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks---and draw learning curves that track the growth of these different measures of linguistic ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words. We find that LMs require only about 10M or 100M words to learn representations that reliably encode most syntactic and semantic features we test. A much larger quantity of data is needed in order to acquire enough commonsense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models.

Analyzing the Necessity of Extensive Pretraining in NLP Models

The paper "When Do You Need Billions of Words of Pretraining Data?" evaluates the relationship between the amount of pretraining data and the linguistic skills acquired by LLMs (LMs) in NLP. It primarily focuses on examining Transformer-based models like RoBERTa and aims to identify whether these models require vast amounts of pretraining data to learn key linguistic features.

Experimental Approach

The research leverages MiniBERTa models, which are variant forms of RoBERTa pretrained on differing data volumes: 1M, 10M, 100M, and 1B words. Additionally, RoBERTa\subtxt{BASE} trained on approximately 30B words is included in the analysis. The methodology involves four probing methods:

  1. Classifier Probing: Evaluates how well linguistic features can be extracted from model representations.
  2. Minimum Description Length (MDL) Probing: Measures the bit-length needed to represent labels using encoded data.
  3. Unsupervised Grammaticality Judgment: Uses the BLiMP benchmark to test syntactic and semantic phenomena in LLMs.
  4. Fine-tuning on NLU Tasks: Assesses how pretraining impacts performance on practical Natural Language Understanding (NLU) tasks.

Key Findings

The experiments reveal two major insights:

  • Feature Acquisition: Models trained on 100M words can almost fully encode linguistic features necessary for syntactic and semantic understanding. This includes tasks like dependency parsing, semantic role labeling, and coreference resolution, where performance largely saturates at this level.
  • Commonsense and NLU Tasks: Substantial data exceeding 1B words is imperative for encoding commonsense knowledge, which plays a critical role in boosting NLU task performance. Therefore, improvements on NLU tasks like SuperGLUE show delayed but continuous enhancements with increasing data quantities, with certain tasks still benefiting from data up to 30B words and likely beyond.

Implications and Speculation

The contrasting needs for linguistic feature acquisition versus full NLU task performance imply that while syntactic and semantic encoding fits within smaller data scales, common sense and contextual reasoning needed for NLU require expansive datasets. This distinction calls for introspection into the construct and representation of knowledge within LMs. These findings suggest a potential avenue in optimizing the efficiency of pretraining LLMs targeted towards practical application tasks.

Considering future AI developments, the paper hints at probable research directions including:

  • Data Efficiency in Pretraining: Developing techniques for acquiring commonsense knowledge from less data or alternate non-text-based sources.
  • Fine-tuning Methodologies: Innovating fine-tuning paradigms to better leverage linguistic features already encoded.
  • Application of Large-Scale Models in Multilingual Contexts: Exploring cross-lingual and cultural nuances on common sense tasks with varied training datasets.

This paper questions the demand for extensive datasets traditionally deemed necessary for strong NLP model development and opens discussions on maximizing learning effectiveness through task-specific approaches.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yian Zhang (12 papers)
  2. Alex Warstadt (35 papers)
  3. Haau-Sing Li (6 papers)
  4. Samuel R. Bowman (103 papers)
Citations (130)
Youtube Logo Streamline Icon: https://streamlinehq.com