Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Trained on 100 million words and still in shape: BERT meets British National Corpus (2303.09859v3)

Published 17 Mar 2023 in cs.CL

Abstract: While modern masked LLMs (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source -- the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a LLMing benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. David Samuel (23 papers)
  2. Andrey Kutuzov (41 papers)
  3. Lilja Øvrelid (42 papers)
  4. Erik Velldal (31 papers)
Citations (23)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com