GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation

Published 18 Nov 2022 in cs.CL | (2211.10330v1)

Abstract: We introduce GENIUS: a conditional text generation model using sketches as input, which can fill in the missing contexts for a given sketch (key information consisting of textual spans, phrases, or words, concatenated by mask tokens). GENIUS is pre-trained on a large-scale textual corpus with a novel reconstruction from sketch objective using an extreme and selective masking strategy, enabling it to generate diverse and high-quality texts given sketches. Comparison with other competitive conditional LLMs (CLMs) reveals the superiority of GENIUS's text generation quality. We further show that GENIUS can be used as a strong and ready-to-use data augmentation tool for various NLP tasks. Most existing textual data augmentation methods are either too conservative, by making small changes to the original text, or too aggressive, by creating entirely new samples. With GENIUS, we propose GeniusAug, which first extracts the target-aware sketches from the original training set and then generates new samples based on the sketches. Empirical experiments on 6 text classification datasets show that GeniusAug significantly improves the models' performance in both in-distribution (ID) and out-of-distribution (OOD) settings. We also demonstrate the effectiveness of GeniusAug on named entity recognition (NER) and machine reading comprehension (MRC) tasks. (Code and models are publicly available at https://github.com/microsoft/SCGLab and https://github.com/beyondguo/genius)

Abstract PDF Upgrade to Chat

Authors (7)

Citations (17)

View on Semantic Scholar

Summary

The paper presents a novel sketch-based pre-training method that uses extreme masking (73% ratio) to generate full-length text from minimal input.
It builds on a BART-based transformer and is pre-trained on 27 million sketch-text pairs from the C4-realnewslike corpus, ensuring diverse and coherent text generation.
The GeniusAug framework boosts NLP performance by augmenting training datasets, significantly improving accuracy in classification, NER, and MRC, especially in low-resource scenarios.

GENIUS: Advancing Sketch-Based LLM Pre-training

The paper introduces GENIUS, a novel model designed for sketch-based text generation, where the primary goal is to generate full-length text given key information sketches. This methodology leverages a pre-training strategy involving extreme and selective masking, distinguishing it from conventional models like BERT, BART, or T5, which are typically hindered by insufficient masking strategies during training.

Model and Methodology

GENIUS's architecture builds upon a transformer model framework, leveraging BART's structure while introducing a new pre-training objective, namely reconstruction from sketch with a high masking ratio (~73%). This strategy allows for effective generation from minimalist sketches containing only essential keywords or phrases. The GENIUS model is pre-trained on the C4-realnewslike corpus, encompassing about 27 million sketch-text pairs, offering a significant breadth of textual diversity.

GENIUS also introduces GeniusAug, a data augmentation framework aimed at enhancing NLP models' performance through enriched training datasets derived from generated texts. By extracting target-aware sketches from existing datasets, GeniusAug effectively generates contextually enriched, diverse text, striking a balance between overly conservative and overly aggressive augmentation methods.

Empirical Performance

The empirical evaluation of GENIUS spans multiple dimensions, examining its proficiency in text generation and its utility in data augmentation across NLP tasks. The generated text was appraised through metrics such as perplexity, sketch retention, and diversity, demonstrating GENIUS's fluency and coherent text generation. The effectiveness of GeniusAug was particularly noted in improving model performance in text classification, named entity recognition (NER), and machine reading comprehension (MRC) across diverse datasets.

In terms of quantifiable merits, GeniusAug exhibited substantial improvements in model accuracy, most notably in low-resource scenarios and out-of-distribution generalization tasks. Compared with existing methods like EDA or LAMBADA, GeniusAug's integration of sketch-based generation outperformed them by generating richer, more relevant textual data.

Implications and Future Directions

GENIUS potentially redefines pre-training methodologies by emphasizing flexible input requirements, thereby expanding applications in domains requiring generative assistance, like story generation or more advanced human-computer interaction environments. Furthermore, the introduction of GeniusAug suggests a broader paradigm shift in data augmentation practices, particularly beneficial for low-resource domains.

Future research could build upon GENIUS by enhancing its attribute control features, refining sketch extraction methodologies, or scaling up the model using more extensive datasets and leveraging larger model architectures. Integrating GENIUS with domain-specific knowledge sources could further amplify its utility across specialized fields, such as legal or medical text generation.

The paper contributes significantly to the NLP landscape, offering insights into leveraging minimal initial information for coherent extensive text creation, thus opening avenues for enhanced creativity and precision in LLM applications.

Markdown Report Issue