Early fusion of text into visual encoders
Develop practical and effective mechanisms for early fusion of natural language into the feature extraction layers of visual encoders, such as Vision Transformers used in multimodal vision–language models, so that language influences visual processing during inference rather than only being combined by late fusion mechanisms.
References
It is not clear how to early-fuse text into the visual encoding process, and it is far easier to late-fuse them (for example, as in Multimodal LLMs (MLLMs) (e.g., )).
— Steerable Visual Representations
(2604.02327 - Ruthardt et al., 2 Apr 2026) in Section 1 (Introduction)