Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MyVLM: Personalizing VLMs for User-Specific Queries (2403.14599v1)

Published 21 Mar 2024 in cs.CV

Abstract: Recent large-scale vision-LLMs (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the LLM to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.

Personalized Vision-LLMs

The paper introduces MyVLM, a methodology aimed at enhancing existing vision-LLMs (VLMs) by enabling them to handle personalized queries related to user-specific concepts. This work addresses the limitations of current VLMs, which typically possess generic knowledge without the ability to comprehend and integrate individual user contexts. MyVLM targets two main tasks: personalized image captioning and visual question-answering.

Methodology Overview

MyVLM operates without altering the core weights of pretrained VLMs, preserving their innate visual and linguistic capabilities. It employs two primary strategies:

  1. Concept Heads: To recognize personalized content, the approach involves augmenting VLMs with external concept heads. These heads are specialized binary classifiers designed to identify the presence of specific user-defined concepts within an image. For humans, a pretrained face recognition model identifies individuals, while for objects, a linear classifier trained on extracted CLIP embeddings is employed.
  2. Embedding Vectors: MyVLM introduces concept embeddings within the VLM’s intermediate feature space. These vectors guide the LLM to incorporate the personalized concept naturally into its output, aligning it with the provided image input. The optimization leverages a small set of examples, where augmentations and regularization techniques enhance generalization and mitigate context leakage during personalization.

Experimental Implementation

MyVLM was tested on two prominent VLM architectures: BLIP-2 and LLaVA. Utilizing these frameworks, the paper demonstrates MyVLM's applicability extending to multiple VLM models. The personalization pipeline is trained with only a handful of images (3-5) per concept, showcasing the model's capacity for efficiency and adaptability.

Results

The effectiveness of MyVLM is illustrated through quantitative and qualitative evaluations, which emphasize improvement over traditional VLMs in recalling and integrating user-specific concepts within captions. The model's ability to precisely incorporate unique concepts, such as individual names or objects, into generated captions and visual queries shows marked advancement. Furthermore, the method showcases consistent results across two diverse VLM structures.

Quantitative Metrics

The model achieves high recall and image alignment in captioning tasks, surpassing baseline methods such as keyword-based replacements and LLM-guided interventions. Across both BLIP-2 and LLaVA models, MyVLM demonstrated significant recall of concept identifiers and improved textual similarity against ground truth captions.

Implications and Future Directions

The approach underscores a pivotal move toward more personalized and meaningful human-computer interaction within VLMs. By allowing models to understand user-specific contexts, MyVLM can enhance applications across personalized content creation, digital assistance, and more nuanced AI interactions. The introduction of external heads also allows for scalable capabilities, with potential expansion to include more diverse and complex concepts over time.

Future work may explore further optimization of concept embeddings and expanded datasets for additional personalization depths. Integrating insights from attention mechanisms could also enhance model robustness against context leakage and enable seamless adaptation to newer VLM architectures. Moreover, ethical considerations concerning privacy and data security should be key focal areas as personalization technology advances.

Overall, MyVLM represents a significant technical step towards individual-centric AI models, providing both methodological contributions and paving the way for further research into adaptive vision-language understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Gpt-4 technical report, 2023.
  2. A neural space-time representation for text-to-image personalization, 2023.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. Artwork personalization at netflix. In Proceedings of the 12th ACM conference on recommender systems, pages 487–488, 2018.
  5. Refact: Updating text-to-image models by editing the text encoder, 2023.
  6. Domain-agnostic tuning-encoder for fast personalization of text-to-image models, 2023.
  7. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  8. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  9. Correcting diverse factual errors in abstractive summarization via post-editing and language model infilling. arXiv preprint arXiv:2210.12378, 2022.
  10. Zero-shot composed image retrieval with textual inversion. arXiv preprint arXiv:2303.15247, 2023.
  11. Rewriting a deep generative model, 2020.
  12. Mocha: Multi-objective reinforcement mitigating caption hallucinations, 2023.
  13. Personalized recommender system for e-learning environment. Education and Information Technologies, 22:1455–1477, 2017.
  14. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
  15. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  16. Vision-language models provide promptable representations for reinforcement learning, 2024.
  17. Can we edit multimodal large language models? arXiv preprint arXiv:2310.08475, 2023.
  18. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  19. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  20. Attend to you: Personalized image captioning with context sequence memory networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 895–903, 2017.
  21. “this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX, pages 558–577. Springer, 2022.
  22. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  23. Editing factual knowledge in language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  24. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020.
  25. Arcface: Additive angular margin loss for deep face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):5962–5979, Oct. 2022.
  26. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  27. Don’t stop learning: Towards continual learning for the clip model, 2022.
  28. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  29. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023.
  30. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, 2023.
  31. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Trans. Graph., jul 2023.
  32. Erasing concepts from diffusion models. arXiv preprint arXiv:2303.07345, 2023.
  33. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  34. Hypernetworks. In International Conference on Learning Representations, 2017.
  35. Aging with grace: Lifelong model editing with discrete key-value adaptors. In Advances in Neural Information Processing Systems, 2023.
  36. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  37. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  38. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  39. Minimizing factual inconsistency and hallucination in large language models, 2023.
  40. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  41. Mistral 7b, 2023.
  42. Xiaoqian Shen Xiang Li Zechun Liu Pengchuan Zhang Raghuraman Krishnamoorthi Vikas Chandra Yunyang Xiong Jun Chen, Deyao Zhu and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478, 2023.
  43. Vision-by-language for training-free compositional image retrieval. arXiv preprint arXiv:2310.09291, 2023.
  44. Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702, 2023.
  45. Multi-concept customization of text-to-image diffusion. 2023.
  46. Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299, 2019.
  47. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  48. Otter: A multi-modal model with in-context instruction tuning, 2023.
  49. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing, 2023.
  50. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  51. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409, 2020.
  52. Pmet: Precise model editing in a transformer. arXiv preprint arXiv:2308.08742, 2023.
  53. Improved baselines with visual instruction tuning, 2023.
  54. Visual instruction tuning. In NeurIPS, 2023.
  55. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  56. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022.
  57. Mass editing memory in a transformer. The Eleventh International Conference on Learning Representations (ICLR), 2023.
  58. Fast model editing at scale. In International Conference on Learning Representations, 2022.
  59. Memory-based model editing at scale. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 15817–15831. PMLR, 17–23 Jul 2022.
  60. Mystyle: A personalized generative prior. ACM Transactions on Graphics (TOG), 41(6):1–10, 2022.
  61. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  62. Towards personalized image captioning via multimodal memory networks. IEEE transactions on pattern analysis and machine intelligence, 41(4):999–1012, 2018.
  63. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  64. Referring expression comprehension: A survey of methods and datasets. IEEE Transactions on Multimedia, 23:4426–4440, 2020.
  65. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  66. Make-a-story: Visual memory conditioned consistent story generation, 2023.
  67. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
  68. Conceptlab: Creative concept generation using vlm-guided diffusion prior constraints, 2023.
  69. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
  70. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19305–19314, 2023.
  71. Engaging image captioning via personality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12516–12526, 2019.
  72. Editable neural networks. arXiv preprint arXiv:2004.00345, 2020.
  73. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023.
  74. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  75. MosaicML NLP Team. Introducing mpt-30b: Raising the bar for open-source foundation models, 2023. Accessed: 2023-06-22.
  76. Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  77. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  78. p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
  79. Freshllms: Refreshing large language models with search engine augmentation, 2023.
  80. Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024.
  81. User-aware prefix-tuning is a good learner for personalized image captioning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 384–395. Springer, 2023.
  82. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022.
  83. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, Oct. 2020. Association for Computational Linguistics.
  84. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  85. Editing large language models: Problems, methods, and opportunities, 2023.
  86. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
  87. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  88. Meta-personalizing vision-language models to find named instances in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19123–19132, 2023.
  89. A survey on multimodal large language models, 2023.
  90. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  91. Automatic generation of personalized comment based on user profile. arXiv preprint arXiv:1907.10371, 2019.
  92. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  93. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  94. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuval Alaluf (22 papers)
  2. Elad Richardson (18 papers)
  3. Sergey Tulyakov (108 papers)
  4. Kfir Aberman (46 papers)
  5. Daniel Cohen-Or (172 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com