Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception (2305.06324v2)
Abstract: We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model and task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) Sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigates the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L variant focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.
- TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 2021.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS, 2020.
- JAX: composable transformations of Python+NumPy programs. 2018. URL http://github.com/google/jax.
- Language models are few-shot learners. NeurIPS, 2020.
- Gemma A Calvert. Crossmodal processing in the human brain: insights from functional neuroimaging studies. Cerebral cortex, 2001.
- A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
- A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Multisensory interplay reveals crossmodal influences on ‘sensory-specific’brain regions, neural responses, and judgments. Neuron, 2008.
- Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
- Smart frame selection for action recognition. In AAAI, 2021.
- Masked autoencoders that listen. NeurIPS, 2022.
- Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 2017.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML. PMLR, 2021.
- A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 2020.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Hmdb: a large video database for human motion recognition. In ICCV, 2011.
- Polyvit: Co-training vision transformers on images, videos and audio. arXiv preprint arXiv:2111.12993, 2021.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In ICML. PMLR, 2022.
- Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, 2019.
- Multimodal contrastive learning with limoe: the language-image mixture of experts. In NeurIPS, 2022.
- Learning audio-video modalities from image captions. arXiv preprint arXiv:2204.00679, 2022.
- Expanding language-image pretrained models for general video recognition. In ECCV, 2022.
- Improved optimization strategies for deep multi-task networks. arXiv preprint arXiv:2109.11678, 2021.
- Combined scaling for open-vocabulary image classification. arXiv preprint arXiv: 2111.10050, 2021.
- Karol J Piczak. Esc: Dataset for environmental sound classification. In ACM MM, 2015.
- Dynamic pretraining of vision-language models, 2023. URL https://openreview.net/forum?id=QcffIcjq8bl.
- Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- Scaling vision with sparse mixture of experts. NeurIPS, 2021.
- Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022.
- ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
- The development of embodied cognition: Six lessons from babies. Artificial life, 2005.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In SIGIR, 2021.
- Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937, 2020.
- Attention is all you need. NeurIPS, 2017.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
- A multigrid method for efficiently training video models. In CVPR, 2020.
- Scaling multimodal pre-training via cross-modality gradient harmonization. In NeurIPS, 2022a.
- Transferring textual knowledge for visual recognition. arXiv preprint arXiv:2207.01297, 2022b.
- Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. arXiv preprint arXiv:2301.00182, 2022c.
- Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2014.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Scaling vision transformers. In CVPR, 2022.
- Mixture-of-experts with expert choice routing. In NeurIPS, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.