Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can We Edit Multimodal Large Language Models? (2310.08475v5)

Published 12 Oct 2023 in cs.CL, cs.AI, cs.CV, cs.LG, and cs.MM
Can We Edit Multimodal Large Language Models?

Abstract: In this paper, we focus on editing Multimodal LLMs (MLLMs). Compared to editing single-modal LLMs, multimodal model editing is more challenging, which demands a higher level of scrutiny and careful consideration in the editing process. To facilitate research in this area, we construct a new benchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite of innovative metrics for evaluation. We conduct comprehensive experiments involving various model editing baselines and analyze the impact of editing different components for multimodal LLMs. Empirically, we notice that previous baselines can implement editing multimodal LLMs to some extent, but the effect is still barely satisfactory, indicating the potential difficulty of this task. We hope that our work can provide the NLP community with insights. Code and dataset are available in https://github.com/zjunlp/EasyEdit.

An Analysis of Multimodal LLM Editing

The paper "Can We Edit Multimodal LLMs?" addresses the burgeoning need to refine and adapt Multimodal LLMs (MLLMs). With the increasing deployment of LLMs, these models must maintain accurate and current knowledge without extensive retraining. Editing MLLMs is inherently complex due to their integration of multiple data modalities. This paper proposes a benchmark, MMEdit, to facilitate research in this domain and evaluates the efficacy of various model editing approaches.

Research Contributions and Methodology

The paper is innovative in presenting MMEdit, a benchmark specifically designed to evaluate the editing capabilities of MLLMs. MMEdit focuses on two primary tasks: Editing Visual Question Answering (E-VQA) and Editing Image Captioning (E-IC). The researchers constructed the dataset by gathering underperforming entries from established datasets, ensuring a robust framework for evaluating the capacity to update the knowledge framework of MLLMs efficiently.

Reliability, locality, and generality metrics have been set forth to measure the success of model editing. These metrics collectively assess the models' ability to maintain updated knowledge while avoiding unintended side effects and retaining the capacity to generalize edits across rephrased inputs.

Experimentation and Results

The researchers conducted extensive experiments using notable MLLMs such as BLIP-2 OPT and MiniGPT-4. Various editing methods including MEND, Knowledge Editor, SERAC, and In-Context Knowledge Editing were evaluated.

Reliability: Editing methods outperformed base methods significantly. Notably, In-Context Editing and SERAC produced high success rates in correcting erroneous outputs. However, fine-tuning approaches struggled with reliability, largely due to their inability to capture task-specific multimodal characteristics adequately.

Locality: Serious challenges were noted in retaining model stability, especially concerning the vision module. While textual locality was well-preserved by most methods, maintaining stability within the vision module proved challenging. Memory-based approaches like SERAC showed the most promise but were hampered by inadequate constraints on the M-Locality.

Generality: Image generalization lagged behind text generalization, a consistent theme across experiments. While memory-enhanced editing methods demonstrated strong generality, their lower locality scores highlighted a key area for future research.

Implications and Future Directions

The implications of these findings are multi-faceted. Practically, the results underscore the importance of targeted model editing techniques that respect the preservation of broader model knowledge. Theoretically, the paper invites further inquiry into efficient multimodal model editing strategies that account for the inherent complexity of these systems.

Future work could explore innovative editing paradigms that incorporate co-editing between modalities—leveraging insights from both visual and textual data to enhance model performance. Additionally, developing methods with better vision editing capabilities will be crucial in addressing current limitations.

In conclusion, this paper sets a foundational tone for subsequent research in MLLM editing, contributing valuable insights and benchmarks to the NLP community. As multimodal models continue to expand in complexity and scope, refining our approaches to knowledge editing will remain a vital frontier in AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. VQA: visual question answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2425–2433. IEEE Computer Society.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  3. Editing factual knowledge in language models. In EMNLP.
  4. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325.
  5. Editing language model-based knowledge graph embeddings.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  7. Evaluating the ripple effects of knowledge editing in language models. CoRR, abs/2307.12976.
  8. Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada.
  9. Knowledge neurons in pretrained transformers. In ACL.
  10. Calibrating factual knowledge in pretrained language models. In EMNLP, Findings of EMNLP.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  12. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  13. Erasing concepts from diffusion models. CoRR, abs/2303.07345.
  14. Llama-adapter V2: parameter-efficient visual instruction model. CoRR, abs/2304.15010.
  15. Generative adversarial networks. CoRR, abs/1406.2661.
  16. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6325–6334. IEEE Computer Society.
  17. Editing commonsense knowledge in GPT. CoRR, abs/2305.14956.
  18. A divide and conquer framework for knowledge editing. Knowledge-Based Systems, 279:110826.
  19. Aging with GRACE: lifelong model editing with discrete key-value adaptors. CoRR, abs/2211.11031.
  20. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. CoRR, abs/2301.04213.
  21. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models.
  22. Image captioning: Transforming objects into words. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 11135–11145.
  23. Inspecting and editing knowledge representations in language models.
  24. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  25. Transformer-patcher: One mistake worth one neuron. CoRR, abs/2301.09785.
  26. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations.
  27. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466.
  28. Otter: A multi-modal model with in-context instruction tuning. CoRR, abs/2305.03726.
  29. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, abs/2301.12597.
  30. Pmet: Precise model editing in a transformer.
  31. Unveiling the pitfalls of knowledge editing for large language models. arXiv preprint arXiv:2310.02129.
  32. Visual instruction tuning. CoRR, abs/2304.08485.
  33. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. Association for Computational Linguistics.
  34. Editing personality for llms.
  35. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 3195–3204. Computer Vision Foundation / IEEE.
  36. Locating and editing factual knowledge in GPT. In NeurIPS.
  37. Mass-editing memory in a transformer. CoRR, abs/2210.07229.
  38. Fast model editing at scale. In ICLR.
  39. Memory-based model editing at scale. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 15817–15831. PMLR.
  40. Can lms learn new entities from descriptions? challenges in propagating injected knowledge. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 5469–5485. Association for Computational Linguistics.
  41. OpenAI. 2022. The blog used to introduce chatgpt. https://openai.com/blog/chatgpt.
  42. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  43. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE.
  44. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  45. Editable neural networks. In ICLR.
  46. EVA-CLIP: improved training techniques for CLIP at scale. CoRR, abs/2303.15389.
  47. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  48. Cross-lingual knowledge editing in large language models.
  49. Easyedit: An easy-to-use knowledge editing framework for large language models. CoRR, abs/2308.07269.
  50. Eva-kellm: A new benchmark for evaluating knowledge editing of llms. CoRR, abs/2308.09954.
  51. Multimodal learning with transformers: A survey. CoRR, abs/2206.06488.
  52. Language anisotropic cross-lingual model editing. ArXiv, abs/2205.12677.
  53. Editing large language models: Problems, methods, and opportunities. CoRR, abs/2305.13172.
  54. mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178.
  55. A survey on multimodal large language models. CoRR, abs/2306.13549.
  56. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  57. Investigating the catastrophic forgetting in multimodal large language models. CoRR, abs/2309.10313.
  58. Transfer visual prompt generator across llms. CoRR, abs/2305.01278.
  59. A survey of large language models. CoRR, abs/2303.18223.
  60. Can we edit factual knowledge by in-context learning? CoRR, abs/2305.12740.
  61. Mquake: Assessing knowledge editing in language models via multi-hop questions. CoRR, abs/2305.14795.
  62. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Siyuan Cheng (41 papers)
  2. Bozhong Tian (13 papers)
  3. Qingbin Liu (13 papers)
  4. Xi Chen (1035 papers)
  5. Yongheng Wang (14 papers)
  6. Huajun Chen (198 papers)
  7. Ningyu Zhang (148 papers)
Citations (23)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com