Early fusion of text into visual encoders

Develop practical and effective mechanisms for early fusion of natural language into the feature extraction layers of visual encoders, such as Vision Transformers used in multimodal vision–language models, so that language influences visual processing during inference rather than only being combined by late fusion mechanisms.

Background

The paper contrasts early vision–language fusion with the predominant late-fusion paradigm in multimodal systems, noting that most existing architectures encode images and text separately and only combine them after visual feature extraction.

The authors argue that early fusion is architecturally desirable because it allows text to steer the visual encoding process, yet acknowledge that the community lacks clear methods for doing so, largely due to the prevalence of unimodal text pretraining and the practical ease of late fusion. The proposed SteerViT aims to address this gap by injecting text via cross-attention into frozen ViT blocks.

References

It is not clear how to early-fuse text into the visual encoding process, and it is far easier to late-fuse them (for example, as in Multimodal LLMs (MLLMs) (e.g., )).

Steerable Visual Representations  (2604.02327 - Ruthardt et al., 2 Apr 2026) in Section 1 (Introduction)