Papers
Topics
Authors
Recent
Search
2000 character limit reached

Propagation and Pitfalls: Reasoning-based Assessment of Knowledge Editing through Counterfactual Tasks

Published 31 Jan 2024 in cs.CL, cs.AI, cs.LG, and stat.ME | (2401.17585v1)

Abstract: Current approaches of knowledge editing struggle to effectively propagate updates to interconnected facts. In this work, we delve into the barriers that hinder the appropriate propagation of updated knowledge within these models for accurate reasoning. To support our analysis, we introduce a novel reasoning-based benchmark -- ReCoE (Reasoning-based Counterfactual Editing dataset) -- which covers six common reasoning schemes in real world. We conduct a thorough analysis of existing knowledge editing techniques, including input augmentation, finetuning, and locate-and-edit. We found that all model editing methods show notably low performance on this dataset, especially in certain reasoning schemes. Our analysis over the chain-of-thought generation of edited models further uncover key reasons behind the inadequacy of existing knowledge editing methods from a reasoning standpoint, involving aspects on fact-wise editing, fact recall ability, and coherence in generation. We will make our benchmark publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651.
  2. Can we edit multimodal large language models? arXiv preprint arXiv:2310.08475.
  3. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
  4. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164.
  5. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  6. Calibrating factual knowledge in pretrained language models. arXiv preprint arXiv:2210.03329.
  7. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
  8. T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  9. Beyond iid: three levels of generalization for question answering on knowledge bases. In Proceedings of the Web Conference 2021, pages 3477–3488.
  10. Do language models have beliefs? methods for detecting, updating, and visualizing model beliefs. arXiv preprint arXiv:2111.13654.
  11. Meta-learning online adaptation of language models. arXiv preprint arXiv:2305.15076.
  12. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
  13. Freebaseqa: A new factoid qa data set matching trivia-style question-answer pairs with freebase. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 318–323.
  14. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  15. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
  16. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229.
  17. Fast model editing at scale. arXiv preprint arXiv:2110.11309.
  18. Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831. PMLR.
  19. Entity cloze by date: What LMs know about unseen entities. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 693–702, Seattle, United States. Association for Computational Linguistics.
  20. Can lms learn new entities from descriptions? challenges in propagating injected knowledge. arXiv preprint arXiv:2305.01651.
  21. Yuval Pinter and Michael Elhadad. 2023. Emptying the ocean with a spoon: Should we edit models? arXiv preprint arXiv:2310.11958.
  22. Das Amitava Rawte Vipula, Sheth Amit. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv: arXiv:2309.05922.
  23. Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. arXiv preprint arXiv:1803.06643.
  24. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214.
  25. Knowledge editing for large language models: A survey. arXiv preprint arXiv:2310.16218.
  26. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
  27. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  28. A comprehensive study of knowledge editing for large language models. arXiv preprint arXiv:2401.01286.
  29. Mquake: Assessing knowledge editing in language models via multi-hop questions. arXiv preprint arXiv:2305.14795.
  30. Modifying memories in transformer models. arXiv preprint arXiv:2012.00363.
Citations (16)

Summary

  • The paper presents ReCoE, a benchmark that evaluates how effectively edited facts propagate through reasoning in LLMs.
  • It compares methods like input-augmentation, QLoRA, and MEMIT, showing that current approaches largely fail in coherent multi-step reasoning.
  • Findings emphasize the need for improved generalization and logical coherence in knowledge editing to support dynamic factual updates.

Reasoning-Based Assessment of Knowledge Editing: Insights from ReCoE

Introduction

The paper "Propagation and Pitfalls: Reasoning-based Assessment of Knowledge Editing through Counterfactual Tasks" (2401.17585) presents a comprehensive evaluation of knowledge editing methods in LLMs, focusing on their ability to propagate updates to interconnected facts and support coherent reasoning. The authors introduce ReCoE, a novel benchmark designed to assess counterfactual knowledge editing across six reasoning schemes, and systematically analyze the limitations of current editing approaches, including input-augmentation, finetuning (QLoRA), and locate-and-edit (MEMIT).

Motivation and Problem Statement

While LLMs encode vast factual knowledge, their ability to update and propagate new information remains limited, especially when reasoning over interconnected facts. Existing editing methods often succeed at direct fact recall but fail to support multi-step reasoning with edited knowledge, as illustrated in reasoning-based assessments. Figure 1

Figure 1: Reasoning-based assessment reveals that existing methods answer edited facts but fail at reasoning with them.

The paper identifies three critical competencies for effective knowledge propagation post-editing: (1) fact-wise editing effectiveness, (2) fact recall accuracy, and (3) logical coherence in generation. These dimensions are essential for robust knowledge editing in real-world applications.

ReCoE Benchmark: Construction and Characteristics

ReCoE is constructed using a hybrid-synthetic approach, combining existing QA datasets and LLM-assisted data synthesis. It covers six reasoning schemes: superlative, comparative, sorting, counting, aggregation, and subtraction. Each datapoint includes a question, answer, supporting facts, counterfactual answer, and counterfactual facts, enabling rigorous evaluation of knowledge propagation. Figure 2

Figure 2: Step-by-step demonstration of ReCoE dataset construction, including data sourcing, generation, and counterfactual creation.

Unlike prior benchmarks that rely on synthetic or triplet-based fact representations, ReCoE employs OpenIE-style facts, introducing greater complexity and ambiguity, which better reflects real-world scenarios. Figure 3

Figure 3: Comparison of fact representations in MQuAKE (triplet-based) and ReCoE (OpenIE-style), highlighting increased complexity in ReCoE.

Experimental Setup

The authors evaluate three representative knowledge editing methods on the Tülu series (Llama-based instruction-tuned models):

  • Input-augmentation: Appends counterfactual facts to the prompt at inference time (upper bound).
  • Finetuning (QLoRA): Parameter-efficient finetuning on new facts.
  • Locate-and-edit (MEMIT): Directly edits feedforward modules in transformer layers to insert new facts.

Performance is measured using the correct_flip metric (percentage of predictions that transition from the original to the counterfactual answer) and is further analyzed via chain-of-thought (CoT) prompting.

Results and Analysis

Knowledge Probing and Editing Performance

Both model scaling and CoT prompting improve baseline QA accuracy. However, after editing, input-augmentation remains the most effective, while QLoRA and MEMIT exhibit substantial deficits in propagating knowledge for reasoning tasks.

  • Input-augmentation achieves reasonable performance, but struggles with aggregation and subtraction (<50% accuracy).
  • QLoRA shows moderate improvement with CoT and scaling, but overall performance is significantly lower than input-augmentation.
  • MEMIT consistently underperforms, with near-zero accuracy in several reasoning schemes and severe degradation in generation coherence.

Fact-wise Editing Effectiveness

Fact-wise editing is assessed via perplexity over factual and counterfactual sentences. QLoRA demonstrates effective editing (lower PPL for counterfactuals post-edit), while MEMIT increases overall perplexity, indicating ineffective edits. Figure 4

Figure 4

Figure 4: Fact-wise perplexity comparison before and after editing with QLoRA and MEMIT (7B). QLoRA achieves effective edits; MEMIT does not.

Fact Recall and Consistency

Fact recall is measured by the relatedness and consistency of generated facts in CoT responses. QLoRA maintains reasonable relatedness but low consistency, indicating memorization without generalization. MEMIT further degrades both metrics, especially in complex reasoning schemes.

Logical Coherence

Coherence of CoT responses is critical for reasoning. QLoRA-edited models show a slight decrease in coherence, while MEMIT-edited models suffer catastrophic loss of coherence, undermining their fundamental language modeling capabilities.

Discussion

QLoRA vs. MEMIT

QLoRA supports effective fact-wise editing and preserves logical coherence but fails to generalize edited knowledge for retrieval. MEMIT is inadequate for real-world factual knowledge, especially with complex subjects and relations, challenging the notion that edited neurons function solely as fact storage.

Model Scaling

Model scaling improves baseline knowledge and input-augmentation performance but does not enhance editing efficacy in QLoRA or MEMIT. Larger models do not exhibit increased factual effectiveness, fact retrieval, or coherence post-editing.

Implications and Future Directions

The findings highlight significant limitations in current knowledge editing methods, particularly in propagating updates for reasoning tasks. The inability to generalize and coherently reason with edited knowledge restricts practical deployment in dynamic knowledge environments. Future research should focus on:

  • Enhancing generalization and retrieval of edited knowledge.
  • Developing editing methods robust to complex, OpenIE-style facts.
  • Integrating richer context during finetuning to improve recall.
  • Addressing catastrophic forgetting in locate-and-edit approaches.

ReCoE provides a challenging benchmark for advancing knowledge editing research, emphasizing the need for methods that support coherent reasoning and robust propagation of updates.

Conclusion

This work introduces ReCoE, a reasoning-based benchmark for evaluating knowledge editing in LLMs, and demonstrates that existing methods fail to propagate updates for coherent reasoning. The analysis reveals critical deficiencies in fact recall and generation coherence, especially for locate-and-edit approaches. These insights establish a foundation for future research aimed at developing more effective and reliable knowledge editing techniques for LLMs.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 4 likes about this paper.