MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions (2305.14795v3)

Published 24 May 2023 in cs.CL

Abstract: The information stored in LLMs falls out of date quickly, and retraining from scratch is often not an option. This has recently given rise to a range of techniques for injecting new facts through updating model weights. Current evaluation paradigms are extremely limited, mainly validating the recall of edited facts, but changing one fact should cause rippling changes to the model's related beliefs. If we edit the UK Prime Minister to now be Rishi Sunak, then we should get a different answer to Who is married to the British Prime Minister? In this work, we present a benchmark, MQuAKE (Multi-hop Question Answering for Knowledge Editing), comprising multi-hop questions that assess whether edited models correctly answer questions where the answer should change as an entailed consequence of edited facts. While we find that current knowledge-editing approaches can recall edited facts accurately, they fail catastrophically on the constructed multi-hop questions. We thus propose a simple memory-based approach, MeLLo, which stores all edited facts externally while prompting the LLM iteratively to generate answers that are consistent with the edited facts. While MQuAKE remains challenging, we show that MeLLo scales well with LLMs (e.g., OpenAI GPT-3.5-turbo) and outperforms previous model editors by a large margin.

PDF Abstract

Assessing Knowledge Editing in LLMs via Multi-Hop Questions

The paper under discussion presents a comprehensive evaluation framework, specifically designed to address the shortcomings in current methodologies for knowledge editing in LLMs. These models store vast amounts of information in their weight structures, yet the content they retain becomes obsolete at a rapid pace. Traditional approaches to updating these models primarily involve retraining, which is often infeasible. Consequently, innovative techniques aimed at amending these models directly in regard to specific facts have emerged; however, current evaluation paradigms are insufficient as they emphasize recall without fully considering the cascading implications of such edits.

This paper introduces a benchmark dubbed MQuAKE (cf. the alias M used in the paper for convenience), which focuses on testing the effects of knowledge edits on multi-hop questions. Multi-hop questions are questions requiring the chaining of multiple factual assertions within the model to derive an answer. For instance, if the model's internal knowledge of the UK Prime Minister is modified, cascading changes should occur such that related queries (e.g., about the Prime Minister’s spouse) are also updated accurately.

The authors critically examine existing editing techniques, concluding that while these techniques perform adequately on direct fact recall tests, they falter drastically when faced with multi-hop questions. They propose an alternative approach named MeLLo, which utilizes an external memory to store edited facts and iteratively prompts the LLM to produce consistent answers in alignment with the new edits. MeLLo outdoes previous methodologies by a significant margin, especially when scaled to very large models (e.g., up to 175B parameters).

The Experimental Landscape

The authors evaluate their benchmark using two distinct datasets: one containing counterfactual edits and the other featuring real-world temporal knowledge updates. For the counterfactual dataset, chains of facts are sampled from Wikidata, filtered through a GPT-J model to retain only the recallable facts. Questions are then generated via automated methods to produce diverse, yet semantically correct queries for each chain of facts. Conversely, the real-world dataset constitutes genuine updates from the Wikidata, meant to assess the models against actual changes occurring over time.

The editorial methods assessed include gradient-based fine-tuning, ROME (which localizes and modifies factual information in specific layers), MEND (an auxiliary network for adaptive fine-tuning), and MEMIT (which extends ROME’s capacity for batch edits). These techniques reveal substantial disparities when scrutinized under the lens of multi-hop evaluations. Edit-wise success rates indicate high performance, yet, the intricate demands of multi-hop accuracy negate these successes, highlighting existing techniques' limitations.

Theoretical and Practical Implications

The findings elucidate critical insights into the limitations of current LLMs regarding knowledge editing. The negative results on multi-hop tests indicate that existing editing techniques fall short of fully integrating new information into the broader semantic structure of LLMs. The methods are, in essence, patchwork solutions that fail to leverage edited facts coherently across related queries. The MeLLo method, on the other hand, with its external memory-based modulation, aligns more closely with the intricate nature of language and its dependencies, providing a more robust solution.

Implications for future research are multifold. Practically, there’s a pressing need for the development of methods enabling LLMs to seamlessly integrate edited facts into their semantic networks to uphold consistency across various queries. Theoretically, these insights call for a reevaluation of the internal architectures of LLMs, possibly necessitating hybrid models that blend static knowledge retention with dynamic, memory-based fact updates.

In conclusion, this paper provides a valuable framework for assessing knowledge editing methodologies and proposes an innovative solution with MeLLo, pushing the boundary towards realizing fully adaptable LLMs. The paper calls on the academic community to further investigate strategies for enabling LLMs to retain consistency and coherence in the face of evolving knowledge landscapes.