Emergent Mind

Model Editing with Canonical Examples

Published Feb 9, 2024 in cs.CL


We introduce model editing with canonical examples, a setting in which (1) a single learning example is provided per desired behavior, (2) evaluation is performed exclusively out-of-distribution, and (3) deviation from an initial model is strictly limited. A canonical example is a simple instance of good behavior, e.g., The capital of Mauritius is Port Louis) or bad behavior, e.g., An aspect of researchers is coldhearted). The evaluation set contains more complex examples of each behavior (like a paragraph in which the capital of Mauritius is called for.) We create three datasets and modify three more for model editing with canonical examples, covering knowledge-intensive improvements, social bias mitigation, and syntactic edge cases. In our experiments on Pythia language models, we find that LoRA outperforms full finetuning and MEMIT. We then turn to the Backpack language model architecture because it is intended to enable targeted improvement. The Backpack defines a large bank of sense vectors--a decomposition of the different uses of each word--which are weighted and summed to form the output logits of the model. We propose sense finetuning, which selects and finetunes a few ($\approx$ 10) sense vectors for each canonical example, and find that it outperforms other finetuning methods, e.g., 4.8% improvement vs 0.3%. Finally, we improve GPT-J-6B by an inference-time ensemble with just the changes from sense finetuning of a 35x smaller Backpack, in one setting outperforming editing GPT-J itself (4.1% vs 1.0%).


  • Language models are enhanced by learning from 'canonical examples', which are minimalist instances that exemplify desired or undesired behaviors, aimed at refining model performance without major deviations from original behavior.

  • The method uses a loss function to guide modifications towards improving on complex tasks based on these canonical examples, requiring models to generalize from these instances without significant alterations to their training.

  • Extensive experiments with Pythia and Backpack models showed that 'sense finetuning', which focuses on a few sense vectors for each canonical example, significantly outperforms other methods, including full finetuning and MEMIT.

  • This approach can be applied to larger models like GPT-J-6B through an inference-time ensemble that combines logits from a pretrained and sense-finetuned smaller model, enhancing the larger model without direct alteration.

Model Editing with Canonical Examples

Language models have revolutionized natural language processing, offering capabilities ranging from simple text generation to complex reasoning tasks. However, these models are not without their flaws, including harboring social biases, propagating incorrect information, and struggling with edge cases in syntax. Addressing these issues traditionally involves either retraining the entire model—a computationally expensive task—or applying patches that can lead to unintended consequences. This discourse explores an alternative approach: model editing with canonical examples, a method crafted to refine models by learning from minimalist, significant examples while strictly limiting deviations from the original model behavior.

Canonical Examples and Their Significance

The concept revolves around utilizing "canonical examples" - singular instances exemplifying desired or undesired behaviors. These examples serve as a basis for model refinement, focusing on enhancing model performance in handling complex tasks derived from these examples and curbing the model's deviation from its initial state. This methodology aims to ensure that the model retains its broad capabilities while correcting specific issues.

Canonical examples are coupled with a loss function indicating the modification's preferential direction, and success in this regime is measured by the model's performance on an evaluation set distinct from the training examples. This setup mandates that the models generalize from these canonical instances to more intricate scenarios without significant alterations to their original training.

Empirical Evaluations and Findings

Extensive experiments have been conducted using Pythia language models to evaluate the efficacy of canonical examples in model editing. Among the finetuning algorithms tested, LoRA (Low-Rank Adaptation) demonstrated superior performance over full finetuning as well as MEMIT, a dedicated model editing technique. These experiments also sparked the development of "sense finetuning" under the Backpack language model architecture, which yielded further advancements. Sense finetuning zeroes in on finetuning a select few (~10) sense vectors for each canonical example, surpassing other methods significantly in performance metrics.

Applying Sense Finetuning Enhancements to Larger Models

A noteworthy extension of our work includes leveraging sense finetuning improvements achieved on smaller Backpack models to enhance much larger pre-existing models, such as GPT-J-6B. This is done by using an inference-time ensemble that combines the logits from a pretrained and sense-finetuned Backpack model, thus imbuing the larger model with the modifications without direct alteration. In stringent evaluation settings, this ensemble approach proved to outperform direct finetuning methods on GPT-J itself, underscoring the potential of smaller, adaptable models to correct larger monoliths.

Theoretical Implications and Future Directions

The study has several theoretical implications for model architecture and the pursuit of model editability. The success of sense finetuning underscores the utility of incorporating architectural features that facilitate targeted improvement post hoc. This suggests a fruitful direction for future research: designing models not just for performance but also for their amenability to precise, post-training corrections.


Model editing with canonical examples emerges as a promising methodology for rectifying specific deficiencies in language models without necessitating comprehensive retraining. By focusing on minimal yet representative examples and employing techniques like sense finetuning, it is possible to achieve targeted improvements while preserving the model's original integrity. This approach not only enhances the model's functionality but also furnishes a blueprint for constructing models that are inherently more adaptable and correctable, paving the way for the next generation of more reliable and robust language models.

Get summaries of trending AI/ML papers delivered straight to your inbox

Unsubscribe anytime.