Necessity of Early Fusion When Using Source–Target Attention in Multimodal Learning
Determine whether early fusion via layer-wise concatenation is necessary to obtain a joint representation when a source–target attention (cross-attention) Transformer encoder is used to model relationships between BERT-derived textual review embeddings and tabular variables (user profiles and location attributes) in a multimodal learning model for Yelp rating prediction.
References
While prior studies have utilized both STA and layer-wise concatenation, if the STA mechanism adequately captures the features of two modalities, it is uncertain whether early fusion is necessary to obtain a joint representation.
— An Efficient Multimodal Learning Framework to Comprehend Consumer Preferences Using BERT and Cross-Attention
(2405.07435 - Niimi, 13 May 2024) in Section 3.1 (Architecture)