Effective multimodal processing in Transformer attention
Determine effective strategies for processing information from multiple modalities—such as visual and textual tokens—within Transformer attention mechanisms, so that modality-specific attention patterns can be jointly modeled and leveraged without degrading performance across tasks.
References
Effectively processing information from multiple modalities in the attention mechanism remains an open question.
— MoH: Multi-Head Attention as Mixture-of-Head Attention
(2410.11842 - Jin et al., 15 Oct 2024) in Appendix: Additional Discussions, Limitations and Future Work, Subsection "Multimodal Inputs"