Multimodal Neurons in Pretrained Text-Only Transformers (2308.01544v2)
Abstract: LLMs demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection layer are not immediately decodable into language describing image content; instead, we find that translation between modalities occurs deeper within the transformer. We introduce a procedure for identifying "multimodal neurons" that convert visual representations into corresponding text, and decoding the concepts they inject into the model's residual stream. In a series of experiments, we show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
- Sarah Schwettmann (12 papers)
- Neil Chowdhury (7 papers)
- Samuel Klein (15 papers)
- David Bau (62 papers)
- Antonio Torralba (178 papers)