This study by Apple Inc. compares specialized language model training approaches for computational efficiency.
It explores the trade-offs between pre-training and specialization costs for various methods like hyper-networks, mixture of experts, and importance sampling.
With ample budgets, hyper-networks and mixtures of experts show better performance, while importance sampling is superior when pre-training budgets are limited.
For small domain-specific datasets, importance sampling achieves the best performance in terms of perplexity.
The research guides organizations on selecting LM training methods within budget limits and discusses balancing training, inference costs, and model performance.
Specialized Language Models (LMs) are designed to understand and generate text in specific domains or tasks with relatively low computational costs, making them practical for real-world applications with limited budgets. Research has explored several axes of adaptability to optimize computational efficiency: the pre-training phase (training prior to knowing the target domain), specialization training (post-target domain identification), and inferences performed by the model, coupled with the size of in-domain datasets. In this context, Apple Inc. researchers introduced an analytical approach to evaluate different LM training methods against constrained computational budgets.
The study presents a systematic comparison of various specialized LM training approaches. Designated by acronyms such as
SLM (Small Language Model),
LLM (Large Language Model),
SLM-is (Importance Sampling),
SLM-mix (Mixture of Experts), and
SLM-hn (Hyper-Networks), each approach has its advantages depending on the available pretraining and specialization budgets.
Large Pretraining Budget: For ample pretraining budgets,
mixture of experts strategies showcase superior perplexity performance. Hyper-networks (SLM-hn) essentially utilize an overarching network that generates parameters for a smaller network (expert), according to the document cluster's characteristics. The mixture of experts (SLM-mix), on the other hand, partitions the pretraining dataset into clusters, with each cluster getting a dedicated small model (expert).
Limited Pretraining Budget: When the pretraining budget is restricted, small models trained on importance-sampled datasets (SLM-is) are preferable. The importance sampling method redesigns the training dataset to closely mimic the domain-specific data distribution, hence making the trained model attuned to the domain with smaller specialization costs.
A substantial portion of the research considerably focuses on empirical results. For evaluating the models, perplexity—a measure of predicting the next word—serves as the primary metric. With a focus on perplexity, researchers chased a curious trend; as the size of the domain-specific dataset increases, the advantage of starting from a domain-oriented pretraining (SLM-hn or SLM-mix) lessens. Most notably, the importance sampling method (SLM-is) achieves remarkable performance, even besting larger models when domain data is scarce.
Concluding the findings, when the specialization datasets are small, the importance sampling strategy prevails as the most effective technique, achieving the lowest perplexity. However, if the budget sustains larger pretraining costs, investing in hyper-networks and mixtures of experts is advantageous—owing to their high parameter count during pretraining, but efficient smaller size during inference. The study points out that distillation, a popular method known for compressing large models into smaller ones while retaining their capacity, did not stand out comparatively in this researched scenario.
The implications of this study are twofold: First, it offers practical guidance for organizations wanting to deploy specialized language models within their computational budget constraints. Secondly, it contributes significant insights to the ongoing conversation in the machine learning community about how to balance training costs against inference budgets and performance for specialized models. Future research directions include extending evaluations to larger models and additional domains, considering downstream applications, and refining hyper-networks with a diversity of conditioning inputs.