Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zorro: the masked multimodal transformer (2301.09595v2)

Published 23 Jan 2023 in cs.CV

Abstract: Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.

Overview of Zorro

The recently introduced technique called Zorro represents an innovative approach in multimodal learning that addresses key limitations of previous methods. Specifically, Zorro enables the engagement of a single backbone Transformer network across various sensory modalities such as audio and video, achieving both uni- and multimodal processing capabilities.

Methodology

Employing a masking strategy, Zorro retains modality-specific portions of the representation, maintaining their purity, while allowing another part of the representation to access all modalities. The paper evaluates Zorro by applying it to three prominent transformer-based architectures, namely ViT, Swin, and HiP, with subsequent contrastive pre-training showing impressive results. The contrastive pre-training, a standout highlight, is facilitated by Zorro's ability to produce both multimodal and modality-specific outputs.

Results

Statistically robust achievements by Zorro in the field of contrastive pre-training are particularly promising. The model yields state-of-the-art results across several multimodal benchmarks, including AudioSet and VGGSound. Furthermore, Zorro exhibits a remarkable ability to perform unimodal inference, specifically on video and audio benchmarks like Kinetics-400 and ESC-50, a testament to the model's versatility.

Contributions and Implications

Zorro's four key contributions encompass the introduction of novel multimodal Transformer architectures for both supervised and self-supervised training, the demonstration of Zorro-modified architectures outperforming their vanilla counterparts, the evidence of efficient pre-training on large-scale audio-visual datasets, and remarkable benchmark performance with the added benefit of unimodal inferencing capability. This positions Zorro as a powerful tool for advancing multimodal AI systems, capable of addressing tasks requiring integration of different types of sensory data with minimal engineering overhead.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Adrià Recasens (19 papers)
  2. Jason Lin (8 papers)
  3. Drew Jaegle (1 paper)
  4. Luyu Wang (19 papers)
  5. Pauline Luc (13 papers)
  6. Antoine Miech (23 papers)
  7. Lucas Smaira (9 papers)
  8. Ross Hemsley (8 papers)
  9. Andrew Zisserman (248 papers)
  10. Jean-Baptiste Alayrac (38 papers)
  11. Joāo Carreira (2 papers)
Citations (18)