Tuning Language Models by Proxy (2401.08565v4)

Published 16 Jan 2024 in cs.CL

Abstract: Despite the general capabilities of large pretrained LLMs, they consistently benefit from further adaptation to better achieve desired behaviors. However, tuning these models has become increasingly resource-intensive, or impossible when model weights are private. We introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs to achieve the same end as direct tuning, but by accessing only its predictions over the output vocabulary, not its parameters. Our method tunes a smaller LM, then applies the difference between the predictions of the small tuned and untuned LMs to shift the original predictions of the larger untuned model in the direction of tuning, while retaining the benefits of larger-scale pretraining. In experiments, when we apply proxy-tuning to Llama2-70B using proxies of only 7B size, we can close 88% of the gap between Llama2-70B and its truly-tuned chat version, when evaluated across knowledge, reasoning, and safety benchmarks. We then demonstrate the generality of proxy-tuning by applying it to domain adaptation on code, and task-specific finetuning on question-answering and math problems. Finally, we show how to proxy-tune a truly black-box LM, GPT-3.5, for temporal adaptation, increasing its knowledge about recent events. Our work demonstrates the promise of using small tuned LMs to efficiently customize large, potentially proprietary LMs through decoding-time guidance.

PDF Abstract

Overview of Proxy-Tuning

Proxy-tuning represents a novel methodology for adapting LLMs to specific tasks or domains without direct tweaking of the model parameters. Traditional fine-tuning has become resource-intensive, and in some cases, impractical when dealing with proprietary models with inaccessible weights. Proxy-tuning introduces a solution that adapts the prediction behavior of LLMs through a lightweight decoding-time algorithm that leverages smaller, more manageable models known as "proxies".

Methodology

The technique involves utilizing smaller trained models (experts) and their untuned counterparts (anti-experts) as references to adjust the prediction logits of a larger, base LLM. During the decoding phase, for every output token, the logits of the base model are shifted in the direction suggested by the difference between the expert and anti-expert. Experiments applying this method have shown that proxy-tuning can substantially close the performance gap between a base LLM and a version that has been directly tuned, indicating that larger models gain the benefits of fine-tuning while preserving the vast knowledge acquired during pre-training.

Experimental Findings

Proxy-tuning has shown remarkable results across various areas. In instruction-following benchmarks, it closed up to 91% of the performance gap between base and directly-tuned models. When used for domain adaptation in coding tasks, it yielded up to a 32% absolute improvement over base models. For task-specific tuning in question-answering and math problems, there was an average absolute improvement of 31%. These outcomes validate that proxy-tuning is not only effective but it can also implement fine-tuning for tasks with stringent constraints.

Implications

This method is particularly advantageous for adapting large proprietary LLMs for user-specific needs when only output probabilities are available. It demonstrates the promise for efficient and effective customization without the need for direct model parameter modification. Also, proxy-tuning might preserve learned knowledge better than direct fine-tuning, which can be invasive and risk forgetting previously acquired information. Its resource efficiency and adaptability make it a compelling alternative to conventional tuning methods and open new avenues for leveraging the capabilities of LLMs for various applications.