Representation Tuning (2409.06927v4)

Published 11 Sep 2024 in cs.LG and cs.CL

Abstract: Activation engineering is becoming increasingly popular as a means of online control of LLMs. In this work, we extend the idea of inference-time steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, we identify activation vectors related to honesty in an open-source LLM (Llama-2-13b-chat). Next, we demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, we show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss ("representation tuning"). Finally, we compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at https://github.com/cma1114/representation_tuning. Tuned models are available at https://huggingface.co/collections/cackerman/representation-tuning-66da1e5ab41cd1b824687d9f.

Summary

The paper demonstrates that fine-tuning activation vectors using cosine similarity and token-based loss yields enhanced model behavior, such as honesty.
The research highlights robust latent representation tuning through modality translation and fusion modules, maintaining performance even with missing inputs.
Parameter efficient methods like Representation Editing and Visual Prompt Tuning reduce trainable parameters while achieving competitive improvements across tasks.

"Representation Tuning" refers to methods for optimally adjusting the internal representations of neural networks in order to achieve desired outcomes. This topic covers a range of techniques in machine learning and deep learning, often focused on improving model performance, efficiency, and alignment with specific tasks or behaviors.

A key paper in this area explores tuning activation vectors in LLMs to achieve specific behavioral traits, such as honesty. By fine-tuning these vectors using a combination of cosine similarity and conventional token-based loss functions, the authors demonstrate enhanced control over the model's output. This method, termed "representation tuning," has shown to outperform standard online steering methods and has proposed practical safety applications for LLMs (2409.06927).

Another significant approach is robust latent representation tuning, particularly in multimodal scenarios where one modality might be missing. By introducing a modality latent translation module and a fusion module, the method ensures robust performance even when one input type (e.g., image or text) is absent. This technique involves keeping the foundational models frozen to retain pre-trained capabilities while focusing on the interaction and correlation between different modalities (2406.06048).

Parameter Efficient Fine-Tuning (PEFT) methods such as Representation Editing (RED) demonstrate the importance of tuning internal representations efficiently. RED operates by scaling and biasing the representation at each layer, thus greatly reducing the number of trainable parameters compared to traditional fine-tuning techniques, and has shown superior performance across various models and tasks (2402.15179).

Visual Prompt Tuning (VPT) is another innovative approach within representation tuning, specifically designed for vision transformers. By introducing minimal trainable parameters in the input space and keeping the large model backbone frozen, VPT achieves significant improvements in downstream tasks, competing effectively even with full fine-tuning approaches (2203.12119).

Overall, representation tuning encompasses diverse strategies aimed at optimizing the internal state of models to improve performance, efficiency, and adaptability to specific tasks or constraints. These methods present promising avenues for enhancing the capabilities of large-scale models while maintaining computational efficiency.

PDF Markdown

Related Papers

Tweets

https://twitter.com/GptMaestro/status/1835440814056775885

YouTube

Show All Videos