Need a Small Specialized Language Model? Plan Early! (2402.01093v2)

Published 2 Feb 2024 in cs.LG and cs.CL

Abstract: LLMs are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get good specialized small LLMs using a large, generic, pretraining set and a limited amount of specialized data. We consider two scenarios, depending on whether (i) one can afford pretraining a model for each specialization task, or (ii) one wants to cheaply adapt a single pretrained model for each task. In the first scenario, we propose an effective solution based on importance sampling: we resample the pretraining set to imitate the specialization data and train a small model on it. In the second scenario, we propose a novel architecture, projected networks (PN). PN is a large network whose parameters can be linearly projected into a small network for specialization. For both scenarios, we demonstrate the empirical effectiveness of our solutions across various domains, training set sizes, and training budgets.

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that importance sampling (SLM-is) yields the lowest perplexity under limited pretraining budgets.
The paper compares hyper-networks and mixture of experts, revealing their superior performance when ample pretraining data is available.
The study offers practical guidance for deploying specialized language models effectively within constrained computational budgets.

Introduction

Specialized LLMs (LMs) are designed to understand and generate text in specific domains or tasks with relatively low computational costs, making them practical for real-world applications with limited budgets. Research has explored several axes of adaptability to optimize computational efficiency: the pre-training phase (training prior to knowing the target domain), specialization training (post-target domain identification), and inferences performed by the model, coupled with the size of in-domain datasets. In this context, Apple Inc. researchers introduced an analytical approach to evaluate different LM training methods against constrained computational budgets.

Training Cost and Methods

The paper presents a systematic comparison of various specialized LM training approaches. Designated by acronyms such as SLM (Small LLM), LLM (LLM), SLM-is (Importance Sampling), SLM-mix (Mixture of Experts), and SLM-hn (Hyper-Networks), each approach has its advantages depending on the available pretraining and specialization budgets.

Large Pretraining Budget: For ample pretraining budgets, hyper-networks and mixture of experts strategies showcase superior perplexity performance. Hyper-networks (SLM-hn) essentially utilize an overarching network that generates parameters for a smaller network (expert), according to the document cluster's characteristics. The mixture of experts (SLM-mix), on the other hand, partitions the pretraining dataset into clusters, with each cluster getting a dedicated small model (expert).
Limited Pretraining Budget: When the pretraining budget is restricted, small models trained on importance-sampled datasets (SLM-is) are preferable. The importance sampling method redesigns the training dataset to closely mimic the domain-specific data distribution, hence making the trained model attuned to the domain with smaller specialization costs.

Experimental Findings

A substantial portion of the research considerably focuses on empirical results. For evaluating the models, perplexity—a measure of predicting the next word—serves as the primary metric. With a focus on perplexity, researchers chased a curious trend; as the size of the domain-specific dataset increases, the advantage of starting from a domain-oriented pretraining (SLM-hn or SLM-mix) lessens. Most notably, the importance sampling method (SLM-is) achieves remarkable performance, even besting larger models when domain data is scarce.

Conclusions

Concluding the findings, when the specialization datasets are small, the importance sampling strategy prevails as the most effective technique, achieving the lowest perplexity. However, if the budget sustains larger pretraining costs, investing in hyper-networks and mixtures of experts is advantageous—owing to their high parameter count during pretraining, but efficient smaller size during inference. The paper points out that distillation, a popular method known for compressing large models into smaller ones while retaining their capacity, did not stand out comparatively in this researched scenario.

The implications of this paper are twofold: First, it offers practical guidance for organizations wanting to deploy specialized LLMs within their computational budget constraints. Secondly, it contributes significant insights to the ongoing conversation in the machine learning community about how to balance training costs against inference budgets and performance for specialized models. Future research directions include extending evaluations to larger models and additional domains, considering downstream applications, and refining hyper-networks with a diversity of conditioning inputs.