Abstract
The paper introduces a novel framework termed "model arithmetic" for tailoring LLMs to generate text with specific characteristics such as vocabulary, style, or tone, without the need for retraining or specialized datasets. Model arithmetic allows a combination of multiple models and attributes into a single framework using mathematical formulations to modify the LLM's token distribution during inference. This method can also subsume prior controlled text generation (CTG) techniques by expressing them as simple formulas. An open source implementation of this framework is made available.
Introduction
LLM customization is essential for diverse applications involving different audience groups. The traditional techniques—prompting and fine-tuning—are either limited in precise control or require extensive data and resources. Model arithmetic overcomes these issues by providing an intuitive method to merge multiple LLMs and attribute models, creating composite models that precisely govern the text output's attributes. This methodology is orthogonal to prior CTG techniques, encompassing them within its formula-based system, thereby broadening the scope and precision of CTG.
Fine-Grained Control via Model Arithmetic
Model arithmetic presents a systematic way to finely control generated text by combining different models reflective of various attributes. The framework's flexibility is highlighted using an example that illustrates the process of assembling models each responsible for particular attributes such as "child," "adult," and "magic," as well as incorporating a classifier for "formality." This method enables the creation of output that is precisely controlled by the influence of each attribute or component, surpassing the capabilities of direct prompting and fine-tuning.
Efficient Model Arithmetic via Generalized Speculative Sampling
One of the challenges with CTG is the increase in inference times due to the necessity of evaluating multiple models. Model arithmetic alleviates this through generalized speculative sampling, an extension of an existing technique designed to lower latency. This enhanced speculative sampling postpones the computation of more expensive model calls within the arithmetic formulas, thus optimizing performance. The result is an efficient execution of model actions with only marginal overhead, even when employing multiple models.
Evaluation
Empirical evaluations demonstrate that model arithmetic can produce expressive content with controlled attributes more effectively than existing CTG methods, particularly in the context of reducing toxicity in generated text. Additionally, the use of speculative sampling within model arithmetic results in significant computational efficiency, reducing model calls by up to 64%. The framework's proficiency in nuanced control without a decline in fluency is further evident in its quantitative analysis across various tasks and comparative analysis against state-of-the-art methodologies.