RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use (2211.08192v1)

Published 15 Nov 2022 in cs.CL and cs.LG

Abstract: Large transformer-based LLMs, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such LLMs are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this paper, we update RobBERT, a RoBERTa-based state-of-the-art Dutch LLM, which was trained in 2019. First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus, e.g. corona-related words. Then we further pre-train the RobBERT model using this dataset. To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.We found that for certain language tasks this update results in a significant performance increase. These results highlight the benefit of continually updating a LLM to account for evolving language use.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (3)

Pieter Delobelle (15 papers)
Thomas Winters (10 papers)
Bettina Berendt (20 papers)

Citations (5)

View on Semantic Scholar

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use (2211.08192v1)

Related Papers