Attention Bottlenecks for Multimodal Fusion (2107.00135v3)

Published 30 Jun 2021 in cs.CV

Abstract: Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that usesfusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

PDF Abstract

Analysis of “Attention Bottlenecks for Multimodal Fusion”

The paper "Attention Bottlenecks for Multimodal Fusion" explores the development of a novel transformer-based architecture aimed at improving the fusion process of multimodal data, such as vision and audio, in computational models. This topic is paramount as it addresses the limitations in conventional approaches that typically involve late-stage fusion, which may not efficiently exploit the relationships between modalities early in the process.

Key Contributions

The central contribution of this paper is the introduction of a model called the Multimodal Bottleneck Transformer (MBT). This architecture employs a strategy that mandates the passage of cross-modal information through a limited number of bottleneck tokens. This approach starkly contrasts with traditional pairwise attention mechanisms and is designed to efficiently distill and merge pertinent information from each modality, effectively reducing computational overhead.

Methodology

The MBT leverages the structure of transformers, specifically utilizing self- and cross-attention in novel ways. The architecture includes:

Multi-layer Fusion Bottlenecks: Instead of full modality-to-modality attention, the model channels interactions through dedicated bottleneck latent units. This approach optimizes computational efficiency and relevance of shared information.
Early, Mid, and Late Fusion Strategies: Different layers of modality-specific processing are tested, with results showing optimal performance when mid-layer fusions are employed. This strategy allows initial layers to specialize in unimodal features before cross-modal information exchange occurs.

The authors rigorously tested the MBT across diverse datasets, achieving state-of-the-art results on benchmarks like AudioSet, Epic-Kitchens, and VGGSound. They conducted comprehensive ablation studies to verify the contribution of different model components and validated the influence of parameters like the fusion layer position and bottleneck token quantity.

Implications

The MBT introduces a more systematic approach to multimodal fusion, potentially influencing the design of future AI systems that rely on multiple modalities. The strategic placement of fusion layers and the deliberate restriction of modality interactions may inspire new methods for reducing complexity in other AI applications.

Moreover, the model’s improved efficacy on benchmarks suggests broader applicability across tasks that require synchronized processing of varied data types, perhaps extending to text and further modalities in future studies.

Future Directions

This paper opens several avenues for future research. Extending the architecture to accommodate additional modalities (e.g., text or optical flow) could provide insights into more universal multimodal models. Furthermore, exploring the application of MBT in self-supervised or unsupervised scenarios could unveil new capabilities and efficiencies.

Conclusion

The detailed exploration of attention bottlenecks for multimodal fusion marks a significant step in refining the way AI systems process and integrate data. While the model demonstrates substantial improvements over conventional late-stage fusion methods, the adaptability of the approach could lead to broader adoption and continued innovation in multimodal AI research. This work stands as a valuable contribution to the field of AI, offering robust solutions to complex multimodal integration challenges.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Arsha Nagrani (62 papers)
Shan Yang (58 papers)
Anurag Arnab (56 papers)
Aren Jansen (25 papers)
Cordelia Schmid (206 papers)
Chen Sun (187 papers)

Citations (491)

View on Semantic Scholar