Effective multimodal processing in Transformer attention

Determine effective strategies for processing information from multiple modalities—such as visual and textual tokens—within Transformer attention mechanisms, so that modality-specific attention patterns can be jointly modeled and leveraged without degrading performance across tasks.

Background

The paper proposes Mixture-of-Head attention (MoH), which treats attention heads as experts with dynamic routing, aiming to reduce redundant head activation and improve efficiency without increasing parameters. While MoH demonstrates strong performance across ViT, DiT, and LLMs, handling multimodal inputs poses distinct challenges due to differing attention patterns for visual and textual tokens observed in prior work.

In the Limitations and Future Work section, the authors explicitly note that effectively processing information from multiple modalities within attention mechanisms remains unresolved, motivating future research into multimodal attention design and routing within Transformer-based architectures.

References

Effectively processing information from multiple modalities in the attention mechanism remains an open question.

— MoH: Multi-Head Attention as Mixture-of-Head Attention (2410.11842 - Jin et al., 15 Oct 2024) in Appendix: Additional Discussions, Limitations and Future Work, Subsection "Multimodal Inputs"

Effective multimodal processing in Transformer attention

Sponsor

Background

References

Related Problems