Understanding Visual Concepts Across Models (2406.07506v1)

Published 11 Jun 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. <orange-cat> = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that new word embeddings are model-specific and non-transferable. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an $\epsilon$-ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost. We show popular soft prompt-tuning approaches find these perturbative solutions when applied to visual concept learning tasks, and embeddings for visual concepts are not transferable. Code for reproducing our work is available at: https://visual-words.github.io.

Authors (4)

Brandon Trabucco (13 papers)
Max Gurinas (3 papers)
Kyle Doherty (3 papers)
Ruslan Salakhutdinov (248 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

Tweets

https://twitter.com/brandontrabucco/status/1800972642008301737

Understanding Visual Concepts Across Models (2406.07506v1)

Summary

Related Papers

GitHub

Tweets