BARThez: a Skilled Pretrained French Sequence-to-Sequence Model (2010.12321v2)

Published 23 Oct 2020 in cs.CL

Abstract: Inductive transfer learning has taken the entire NLP field by storm, with models such as BERT and BART setting new state of the art on countless NLU tasks. However, most of the available models and research have been conducted for English. In this work, we introduce BARThez, the first large-scale pretrained seq2seq model for French. Being based on BART, BARThez is particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French LLMs such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez' corpus, and show our resulting model, mBARThez, to significantly boost BARThez' generative performance. Code, data and models are publicly available.

PDF Abstract

Overview of "BARThez: A Skilled Pretrained French Sequence-to-Sequence Model"

The paper presents BARThez, a sequence-to-sequence model trained specifically for the French language, extending the work on pre-trained models predominantly developed for English. BARThez utilizes the BART architecture which combines a bidirectional encoder with a forward decoder, optimized for generative tasks in natural language processing. The researchers evaluate BARThez using both discriminative tasks from the FLUE benchmark and generative tasks from the OrangeSum dataset, a novel dataset they created for this research.

Key Points and Methodology

Pretraining Architecture: BARThez adopts the BART architecture, which is based on a denoising autoencoder design. This architecture comprises a Transformer-based bidirectional encoder and a left-to-right autoregressive decoder.
Training Corpus: The model was pretrained on a substantial corpus adapted from FlauBERT's resources, amounting to 101 GB of French text. This corpus included data from CommonCrawl, NewsCrawl, Wikipedia, and other sources.
Performance Evaluation: BARThez demonstrates competitive performance against existing French-LLMs, notably CamemBERT and FlauBERT. It performs well on sentiment analysis, paraphrasing, and natural language inference tasks. The researchers highlight its proficiency in generative tasks, likely due to the underlying seq2seq architecture of BART.
Innovations with mBARThez: In continued efforts, the authors pretrained a multilingual BART on BARThez's corpus, creating mBARThez. The adapted model exhibited significant improvements in generative tasks compared to the baseline BARThez model, underscoring the benefits of language-specific adaptation in multilingual models.

Results and Implications

The results illustrate that BARThez, despite being a BASE architecture model with fewer parameters, rivals other French models in performance. Additionally, the paper indicates that leveraging multilingual models and further adapting them for specific languages can yield notable gains, as evidenced by mBARThez's enhanced outcomes.

Implications for AI and NLP:

Generative Task Performance: The seq2seq architecture optimized in BARThez offers strong potential for applications requiring high-quality text generation, such as translation and summarization.
Language-Specific Adaptations: Models like mBARThez underline the importance of fine-tuning pretrained multilingual models on monolingual corpora to refine task-specific performance even further.
Public Accessibility: The decision to publicly release both BARThez and OrangeSum dataset supports accessibility and development in French NLP research, encouraging applications spanning commercial and academic settings.

Future Directions

The paper suggests further exploration into language-adaptive pretraining, highlighting its impact on increasing performance across generative tasks. Future research might explore optimizing resource allocation for training large-scale LLMs without sacrificing downstream performance or extend these approaches to other underrepresented languages in NLP. The availability of datasets like OrangeSum could serve as a catalyst for such innovation.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Citations (65)

View on Semantic Scholar

Related Papers

Find Related Papers