Metadata Might Make Language Models Better

Published 18 Nov 2022 in cs.CL and cs.DL | (2211.10086v1)

Abstract: This paper discusses the benefits of including metadata when training LLMs on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and geographical information into a Masked LLM. After fine-tuning several DistilBERT on enhanced input data, we provide a systematic evaluation of these models on a set of evaluation tasks: pseudo-perplexity, metadata mask-filling and supervised classification. We find that showing relevant metadata to a LLM has a beneficial impact and may even produce more robust and fairer models.