Trained on 100 million words and still in shape: BERT meets British National Corpus (2303.09859v3)

Published 17 Mar 2023 in cs.CL

Abstract: While modern masked LLMs (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source -- the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a LLMing benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

Authors (4)

David Samuel (23 papers)
Andrey Kutuzov (41 papers)
Lilja Øvrelid (42 papers)
Erik Velldal (31 papers)

Citations (23)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - ltgoslo/ltg-bert: LTG-Bert (33 stars)

Trained on 100 million words and still in shape: BERT meets British National Corpus (2303.09859v3)

Summary

Related Papers

GitHub