Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling (2410.03735v2)

Published 30 Sep 2024 in cs.CL and cs.LG

Abstract: Specialist LMs focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most tasks. In this work, we build specialist models from large generalist training sets instead. We propose a novel method, ClusteRed Importance SamPling (CRISP). CRISP clusters the generalist dataset and samples from these clusters based on their frequencies in the smaller specialist dataset. It is scalable, suitable for both pretraining and continued pretraining, and works well in multi-task settings. CRISP performs favorably compared to other methods that adjust the training distribution of the generalist data with guidance from the limited domain-specific data. Our findings demonstrate improvements across different domains in terms of LLMing perplexity and accuracy on multiple-choice question tasks. We also present ablation studies that examine the impact of dataset sizes, clustering configurations, and model sizes.

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (4)

Tweets

https://twitter.com/GrangierDavid/status/1846944168582787084

https://twitter.com/gm8xx8/status/1846949003188986182

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling (2410.03735v2)

Summary

Follow-up Questions

Related Papers

Authors (4)

Tweets