Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Neurons in Pretrained Text-Only Transformers (2308.01544v2)

Published 3 Aug 2023 in cs.CV and cs.CL

Abstract: LLMs demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection layer are not immediately decodable into language describing image content; instead, we find that translation between modalities occurs deeper within the transformer. We introduce a procedure for identifying "multimodal neurons" that convert visual representations into corresponding text, and decoding the concepts they inject into the model's residual stream. In a series of experiments, we show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sarah Schwettmann (12 papers)
  2. Neil Chowdhury (7 papers)
  3. Samuel Klein (15 papers)
  4. David Bau (62 papers)
  5. Antonio Torralba (178 papers)
Citations (23)
X Twitter Logo Streamline Icon: https://streamlinehq.com