Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities (2311.05698v3)

Published 9 Nov 2023 in cs.CV

Abstract: One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

Authors (6)

AJ Piergiovanni (40 papers)
Isaac Noble (5 papers)
Dahun Kim (31 papers)
Michael S. Ryoo (75 papers)
Victor Gomes (1 paper)
Anelia Angelova (61 papers)

Citations (17)

View on Semantic Scholar

Summary

Overview of Mirasol3B: A Multimodal Autoregressive Model for Video, Audio, and Text Synchronization

Multimodal learning, particularly in the convergence of video, audio, and text, presents significant challenges due to the heterogeneity in modalities and their asynchronous characteristics. The paper proposes a novel framework called Mirasol3B, designed to address these complexities. Mirasol3B effectively decouples the modeling into separate autoregressive components, specific to time-aligned modalities (video and audio) and non-time-aligned context modalities (text). This decoupling allows the model to handle high-volume, high-frequency inputs more efficiently, and to model both short- and long-term dependencies effectively.

Introduction

Multimodal models have gained traction due to their universal applicability in tasks involving multiple types of data. However, combining modalities such as video, audio, and text is non-trivial because of differences in sampling rates, data volumes, and alignment. Mirasol3B circumvents these issues by segregating the modeling tasks: time-synchronized modalities are handled by one set of autoregressive models, while contextually disparate modalities like text are managed by another.

Architectural Innovations

The Mirasol3B framework is composed of two primary components: one tailored for time-aligned modalities (video and audio) and another dedicated to non-time-aligned modalities (text). These segments work in concert through cross-attention mechanisms that facilitate parameter distribution and enable efficient processing.

Time-Aligned Video/Audio Autoregressive Modeling

The model addresses long-sequence video and audio inputs by partitioning them into snippets before applying a Combiner mechanism. The Combiner is responsible for joint feature extraction, merging video and audio features into a compact, expressive representation.

Combiner Mechanism

Mirasol3B incorporates two versions of the Combiner: a standard Transformer-based version and a memory-efficient Token Turing Machine (TTM) version. The TTM is particularly advantageous for memory management, reducing the computational load and runtime considerably. Both versions ensure that the Combiner maintains causality, processing video and audio features sequentially.

Non-Time-Aligned Contextual Autoregressive Modeling

Mirasol3B uses a separate autoregressive model for non-time-aligned inputs like textual descriptions or questions, conditioned on the embeddings of the video/audio representations. Cross-attention mechanisms are employed to integrate information from the audio-video autoregressive model with the text model, enhancing the contextual understanding of the text given the multimedia inputs.

Experimental Evaluation and Results

Mirasol3B was extensively evaluated on several benchmarks and demonstrated state-of-the-art performance across various tasks:

Video QA: On the MSRVTT-QA dataset, Mirasol3B outperformed existing models, achieving 50.42% accuracy.
Long Video QA: In the ActivityNet-QA and NExT-QA benchmarks, the model proved its efficacy in handling long video sequences, obtaining 51.13% and 72.0% accuracy, respectively.
Audio-Video Tasks: For tasks requiring integrated audio-video understanding, such as those in the Kinetics-Sound and VGG-Sound datasets, Mirasol3B showcased superior performance, with notable improvements over previous state-of-the-art methods.

Implications and Future Directions

The practical and theoretical implications of Mirasol3B are considerable. The model's ability to handle long video sequences without increasing computational load sets a new standard for efficiency and scalability in multimodal learning. Furthermore, the architecture's separation of time-aligned and contextual modalities offers a blueprint for future models to manage high-volume media inputs effectively.

Conclusion

Mirasol3B introduces a balanced and efficient approach to multimodal learning, effectively addressing the challenges of combining heterogeneous modalities like video, audio, and text. Its innovative architecture and significant improvements over existing models make it a valuable contribution to the field. The model's success also opens the door to future advancements focused on refining the Combiner mechanism and exploring new autoregressive strategies for even more complex multimodal tasks.

PDF Markdown

Related Papers

YouTube

Show All Videos