Playing with Words at the National Library of Sweden -- Making a Swedish BERT (2007.01658v1)

Published 3 Jul 2020 in cs.CL

Abstract: This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLab for data-driven research at the National Library of Sweden (KB). Building on recent efforts to create transformer-based BERT models for languages other than English, we explain how we used KB's collections to create and train a new language-specific BERT model for Swedish. We also present the results of our model in comparison with existing models - chiefly that produced by the Swedish Public Employment Service, Arbetsf\"ormedlingen, and Google's multilingual M-BERT - where we demonstrate that KB-BERT outperforms these in a range of NLP tasks from named entity recognition (NER) to part-of-speech tagging (POS). Our discussion highlights the difficulties that continue to exist given the lack of training data and testbeds for smaller languages like Swedish. We release our model for further exploration and research here: https://github.com/Kungbib/swedish-bert-models .

Citations (117)

View on Semantic Scholar

Summary

The paper presents the creation of KB-BERT, a Swedish-specific BERT model that significantly improves natural language processing for the Swedish language.
It details the compilation and preprocessing of an extensive, diverse Swedish corpus, overcoming OCR and segmentation challenges to enhance data quality.
The study benchmarks KB-BERT against multilingual and existing Swedish models, demonstrating superior NER performance and identifying areas for further improvement.

An Overview of the Development and Evaluation of Swedish BERT (KB-BERT)

The paper "Playing with Words at the National Library of Sweden - Making a Swedish BERT" presents the development, training, and evaluation of a Swedish-specific BERT model, named KB-BERT, by the KBLab at the National Library of Sweden. In an effort to improve NLP capabilities for the Swedish language, the research explores the application of Bidirectional Encoder Representations from Transformers (BERT) in creating a high-performing LLM tailored specifically to handle Swedish language data.

Creation of a Swedish Corpus

One pivotal aspect of the paper is the compilation and preprocessing of a vast and diverse Swedish text corpus, drawn from various sources, including digitized newspapers, government reports, legal e-deposits, Swedish Wikipedia, and social media content. The corpus comprises approximately 3.5 billion words across different types of text files, with a noticeable emphasis on newspaper content due to its diversity. However, incorporating data from various domains is deemed beneficial to the comprehensive capability and robustness of the model.

The paper addresses significant preprocessing challenges, such as correcting frequent OCR errors and adequately handling sentence and paragraph segmentation to ensure that the text data is of high quality for training purposes. Additionally, the importance of maintaining emojis, particularly from social media text, is acknowledged as a necessary step for analyzing modern conversational Swedish.

Training and Evaluation of KB-BERT

The KB-BERT model was pre-trained using Google's BERT framework, entailing steps that involved extensive computation, partially utilizing Google's TensorFlow Research Cloud resources. Evaluation is executed using named entity recognition (NER) and part-of-speech (POS) tagging tasks. The results demonstrate that KB-BERT surpasses existing multilingual models like Google’s M-BERT and Swedish-specific models such as those developed by Arbetsförmedlingen.

In NER tasks, KB-BERT shows superior performance with a higher average FB1 score compared to competing models. In POS tagging, though the performance difference is not as pronounced, the improvement remains evident.
For harder NER tasks, some inconsistencies are noted, with varied tagging performance during different training runs, indicating an area for further investigation.

Key Insights and Implications

The findings underscore several meaningful insights:

Data Volume and Diversity: As expected, the performance gain from KB-BERT over multilingual models highlights the advantage of large-scale and language-specific training data. Furthermore, incorporating diverse linguistic registers, including colloquial and less formal text, results in a model better equipped to handle varying types of input.
Resource Challenges for Swedish Language: The paper elucidates the ongoing challenge of limited testbeds for LLM evaluation in Swedish. Addressing this gap is crucial for progressing in multilingual and monolingual NLP beyond well-resourced languages like English.
Path Forward: Acknowledging these limitations, the researchers plan to collaborate on initiatives aimed at enhancing Swedish NLP testbeds. This endeavor reinforces the need for comprehensive evaluation mechanisms to foster the development of more efficient Swedish LLMs.

Overall, the development of KB-BERT represents a significant step forward for NLP in Swedish, setting a benchmark for future advancements and offering valuable lessons in the management of language resources and model training strategies. The release of KB-BERT for public use fosters broader exploration and potential breakthroughs in natural language understanding for Swedish and other low-resource languages.

PDF Markdown

Related Papers

GitHub

GitHub - Kungbib/swedish-bert-models (137 stars)