Necessity of Early Fusion When Using Source–Target Attention in Multimodal Learning

Determine whether early fusion via layer-wise concatenation is necessary to obtain a joint representation when a source–target attention (cross-attention) Transformer encoder is used to model relationships between BERT-derived textual review embeddings and tabular variables (user profiles and location attributes) in a multimodal learning model for Yelp rating prediction.

Background

The paper proposes a context-aware multimodal model that processes textual reviews with BERT and combines them with tabular features (user and location data) using cross-attention (source–target attention) in the output subsystem. Traditional multimodal approaches often use early fusion through layer-wise concatenation to form a joint representation.

The authors note that, if the source–target attention mechanism adequately captures cross-modal relationships, it may be unnecessary to also perform early fusion. They therefore adopt a cross-attention Transformer encoder without feature fusion, but explicitly flag uncertainty about whether early fusion is needed when attention already captures modality interactions.

References

While prior studies have utilized both STA and layer-wise concatenation, if the STA mechanism adequately captures the features of two modalities, it is uncertain whether early fusion is necessary to obtain a joint representation.

— An Efficient Multimodal Learning Framework to Comprehend Consumer Preferences Using BERT and Cross-Attention (2405.07435 - Niimi, 13 May 2024) in Section 3.1 (Architecture)

Necessity of Early Fusion When Using Source–Target Attention in Multimodal Learning

Sponsor

Background

References

Related Problems