Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CamemBERT: a Tasty French Language Model (1911.03894v3)

Published 10 Nov 2019 in cs.CL

Abstract: Pretrained LLMs are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based LLMs for other languages, taking French as an example and evaluating our LLMs on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

An Expert Review of "CamemBERT: a Tasty French LLM"

Authors: Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Djame Seddah, Benoît Sagot.

Overview

The paper presents "CamemBERT,” a monolingual Transformer-based LLM specifically designed for the French language. Leveraging the RoBERTa architecture, CamemBERT is evaluated across diverse NLP tasks, namely part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER), and natural language inference (NLI). The authors introduce the novelty of using web-crawled datasets such as OSCAR over traditional sources like Wikipedia. Distinctively, they demonstrate that high performance can be achieved using relatively smaller datasets of about 4GB, challenging the conventional need for large-scale data.

Key Contributions

  1. Monolingual Focus: The paper is one of the first to release a monolingual LLM for French using open-source corpora like OSCAR, surpassing previous models that primarily catered to English.
  2. Data Efficiency: By evaluating models pretrained on different datasets and sizes, the authors provide evidence that substantial performance can be obtained from smaller corpora (~4GB), thus reducing the required computational resources.
  3. Extensive Evaluation: CamemBERT is rigorously tested on multiple downstream tasks, establishing new state-of-the-art benchmarks specifically for the French language.
  4. Publicly Available Model: The pretrained CamemBERT model and its training data are made accessible under an MIT open-source license, promoting reproducibility and further research.

Detailed Analysis

Architecture and Pretraining:

CamemBERT adapts the RoBERTa architecture which improves upon BERT by incorporating dynamic masking, eliminating the next sentence prediction task, and leveraging extensive pretraining on more data with larger batch sizes. The model utilizes the masked LLMing (MLM) objective, with whole-word masking strategies demonstrating slight advantages for complex semantic tasks like NLI.

Downstream Task Performance:

  1. POS Tagging and Dependency Parsing: CamemBERT’s fine-tuned and embedding versions both surpass previous state-of-the-art results, indicating the redundancy of complex task-specific architectures when using robust pretrained models.
  2. Named Entity Recognition (NER): The model exhibits an F1 score of 89.08 when fine-tuned, surpassing traditional and neural CRF-based baselines, with a slight edge attained when used as contextual embeddings.
  3. Natural Language Inference (NLI): Outperforming multilingual models like mBERT and XLM-R, CamemBERT (BASE) achieves an accuracy of 82.5%, which is further improved to 85.7% with the LARGE variant, highlighting the model’s semantic understanding capabilities.

Impact of Corpus Origin and Size:

The paper provides a nuanced comparison between diverse datasets (Wikipedia, OSCAR, CCNet) and concludes that heterogeneous and noisier datasets like OSCAR can yield superior model performance. Notably, their analysis reveals that 4GB datasets are often sufficient to train effective LLMs, suggesting significant computational savings.

Implications and Future Directions:

Practically, CamemBERT advances robust NLP applications in French, such as anonymization in legal texts and question-answering benchmarks like FQA. Theoretically, it underscores the viability of monolingual models trained on smaller, heterogeneous datasets. Future work could explore the balance between monolingual and multilingual models, and the optimal dataset characteristics for various languages and tasks.

Conclusion

CamemBERT represents a pivotal step in NLP for non-English languages by providing a high-performing, resource-efficient model for French. It defies the necessity of extensive data volumes, fostering the development of advanced LLMs across diverse languages. The open availability of CamemBERT further augments its impact on research reproducibility and application development. Future studies could expand on these insights to optimize and diversify the scope of LLMs across various low-resource settings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Louis Martin (21 papers)
  2. Benjamin Muller (20 papers)
  3. Pedro Javier Ortiz Suárez (3 papers)
  4. Yoann Dupont (4 papers)
  5. Laurent Romary (45 papers)
  6. Éric Villemonte de la Clergerie (4 papers)
  7. Djamé Seddah (28 papers)
  8. Benoît Sagot (60 papers)
Citations (920)
X Twitter Logo Streamline Icon: https://streamlinehq.com