Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pre-training Polish Transformer-based Language Models at Scale (2006.04229v2)

Published 7 Jun 2020 in cs.CL

Abstract: Transformer-based LLMs are now widely used in NLP. This statement is especially true for English language, in which many pre-trained models utilizing transformer-based architecture have been published in recent years. This has driven forward the state of the art for a variety of standard NLP tasks such as classification, regression, and sequence labeling, as well as text-to-text tasks, such as machine translation, question answering, or summarization. The situation have been different for low-resource languages, such as Polish, however. Although some transformer-based LLMs for Polish are available, none of them have come close to the scale, in terms of corpus size and the number of parameters, of the largest English-LLMs. In this study, we present two LLMs for Polish based on the popular BERT architecture. The larger model was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text. We describe our methodology for collecting the data, preparing the corpus, and pre-training the model. We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements over previous approaches in eleven of them.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sławomir Dadas (11 papers)
  2. Michał Perełkiewicz (7 papers)
  3. Rafał Poświata (9 papers)
Citations (34)

Summary

We haven't generated a summary for this paper yet.