Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding (2010.12148v2)

Published 23 Oct 2020 in cs.CL and cs.LG
ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding

Abstract: Coarse-grained linguistic information, such as named entities or phrases, facilitates adequately representation learning in pre-training. Previous works mainly focus on extending the objective of BERT's Masked LLMing (MLM) from masking individual tokens to contiguous sequences of n tokens. We argue that such contiguously masking method neglects to model the intra-dependencies and inter-relation of coarse-grained linguistic information. As an alternative, we propose ERNIE-Gram, an explicitly n-gram masking method to enhance the integration of coarse-grained information into pre-training. In ERNIE-Gram, n-grams are masked and predicted directly using explicit n-gram identities rather than contiguous sequences of n tokens. Furthermore, ERNIE-Gram employs a generator model to sample plausible n-gram identities as optional n-gram masks and predict them in both coarse-grained and fine-grained manners to enable comprehensive n-gram prediction and relation modeling. We pre-train ERNIE-Gram on English and Chinese text corpora and fine-tune on 19 downstream tasks. Experimental results show that ERNIE-Gram outperforms previous pre-training models like XLNet and RoBERTa by a large margin, and achieves comparable results with state-of-the-art methods. The source codes and pre-trained models have been released at https://github.com/PaddlePaddle/ERNIE.

Overview of ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked LLMing for Natural Language Understanding

Introduction

The paper introduces ERNIE-Gram, a novel approach to LLM pre-training that leverages explicitly masked n-grams to enhance natural language understanding (NLU). This approach addresses the limitations of traditional masked LLMs (MLMs), such as BERT, which predominantly focus on individual tokens. ERNIE-Gram seeks to integrate coarse-grained linguistic information, such as phrases and named entities, into the pre-training process.

Methodology

ERNIE-Gram employs a unique methodology involving several components:

  1. Explicitly N-Gram Masking: Unlike conventional MLMs that mask sequences of contiguous tokens, ERNIE-Gram masks n-grams using explicit n-gram identities. This reduces the prediction space, making it more focused and effective.
  2. Comprehensive N-Gram Prediction: The model predicts masked n-grams in both coarse-grained and fine-grained manners, allowing for a more robust understanding of n-gram semantics.
  3. Enhanced N-Gram Relation Modeling: A generator model samples plausible n-gram identities, enabling the main model to learn subtle semantic relationships between n-grams.

The model is pre-trained on English and Chinese corpora and fine-tuned across 19 downstream tasks.

Results

Empirical evaluations reveal that ERNIE-Gram significantly outperforms previous models, such as XLNet and RoBERTa, on benchmark tasks in NLU. Key highlights include strong performance on the GLUE benchmark and SQuAD, where ERNIE-Gram achieved notable improvements over the baseline methods.

Analysis

The use of explicitly n-gram masking allows ERNIE-Gram to maintain tighter intra-dependencies within coarse-grained text units, effectively capturing semantic details that conventional models might overlook. The n-gram relation modeling via plausible sample identities further enhances this advantage by modeling semantic pair relationships.

Implications

From a theoretical standpoint, ERNIE-Gram expands the possibilities for integrating more detailed semantic information into LLMs without increasing model complexity during fine-tuning. Practically, this aligns well with tasks requiring nuanced understanding of complex linguistic structures, such as named entity recognition and question answering.

Future Directions

The success of ERNIE-Gram suggests several avenues for further research:

  • Exploration of larger and more comprehensive n-gram lexicons beyond tri-grams.
  • Application of ERNIE-Gram in multi-lingual contexts to assess its adaptability across diverse linguistic datasets.
  • Expansion to larger model sizes to determine scaling properties and impacts on even more resource-intensive tasks.

In conclusion, ERNIE-Gram contributes a significant advancement in the integration of n-gram semantics into LLM pre-training, offering both theoretical insights and practical enhancements to current NLU challenges.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Dongling Xiao (10 papers)
  2. Yu-Kun Li (1 paper)
  3. Han Zhang (338 papers)
  4. Yu Sun (226 papers)
  5. Hao Tian (146 papers)
  6. Hua Wu (191 papers)
  7. Haifeng Wang (194 papers)
Citations (36)