CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion (2402.05889v3)

Published 8 Feb 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Despite impressive advancements in recent multimodal reasoning approaches, they are still limited in flexibility and efficiency, as these models typically process only a few fixed modality inputs and require updates to numerous parameters. This paper tackles these critical challenges and proposes CREMA, a generalizable, highly efficient, and modular modality-fusion framework that can incorporate any new modality to enhance video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio, thermal heatmap, and touch map) from given videos without extra human annotation by leveraging sensors or existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to integrate different data types for response generation. Furthermore, we propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy. It helps compress information across various assisting modalities, maintaining computational efficiency in the LLM while improving performance. We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including conventional VideoQA and Video-Audio/3D/Touch/Thermal QA, and achieve better/equivalent performance against strong multimodal LLMs, including OneLLM, BLIP-2, and SeViLA while reducing over 90% trainable parameters. We provide extensive analyses of CREMA, including the impact of each modality on reasoning domains, the design of the fusion module, and example visualizations.

PDF Abstract

Efficient Multimodal Video Reasoning with CREMA

In the field of multimodal learning, the integration of diverse sensory inputs such as audio, visual, and text data enables AI models to mimic human-like understanding of the world around us. However, one of the key challenges in this domain involves efficiently processing and fusing these varied data types for tasks like video reasoning. This challenge is addressed by the model termed CREMA (Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion), which offers a novel approach to efficiently adapt and fuse multimodal inputs without the need for substantial parameter updates typically associated with such processes.

Approach and Contributions

CREMA introduces a framework that leverages modular adaptation to inject new modalities into video reasoning tasks efficiently. Its architecture is built on a backbone of visual-LLMs, extending their capabilities to accommodate additional modalities such as depth maps, optical flow, audio, and 3D point clouds with minimal trainable parameters.

One of the key innovations within CREMA is its usage of Multimodal Q-Former (MMQA) modules. These modules are specifically designed for parameter-efficient projection of diverse modality features into a common embedding space, facilitating seamless integration with the pre-trained LLM backbone. Intriguingly, CREMA utilizes pre-trained models to augment video input with additional informative modalities, requiring no extra human annotation effort. This proactive use of existing models for data augmentation represents a practical and resourceful approach to enriching the input space.

Additionally, the framework addresses the computational efficiency challenge through its modality fusion module dubbed CREMA-Espresso. This module efficiently combines multimodal queries, ensuring computational resources are conserved even as the model processes increased modality inputs. This aspect is critical for maintaining scalability and practical applicability in real-world scenarios.

Evaluation and Insights

In its evaluation, CREMA was tested across various video reasoning benchmarks, demonstrating equal or superior performance to pre-existing multimodal LLMs (MLLMs) while requiring significantly fewer trainable parameters. Specifically, on tasks demanding understanding beyond mere video and text inputs, CREMA's performance showcases the effective and efficient reasoning capabilities of its architecture. These favorable outcomes are underpinned by its novel fusion strategy and modular design, which allow for the dynamic integration of various modalities without substantial computational overhead.

The research further provides an in-depth analysis of how different modalities influence reasoning tasks, offering valuable insights into the design of multimodal fusion strategies and the practical implications of adopting such models in applications. For instance, the addition of depth and audio information in video reasoning tasks not only improves performance metrics but also highlights the model's capability to leverage nuanced data types for enriched contextual understanding.

Future Directions

Looking ahead, the ability of CREMA to effortlessly incorporate new modalities opens exciting avenues for future explorations in AI. It sets the stage for more advanced models that could potentially integrate even more diverse data types, such as tactile or olfactory information, pushing the boundaries of AI's multi-sensory understanding.

Moreover, the modular design of CREMA encourages further exploration into more sophisticated fusion techniques that could enhance the model's ability to discern relevant information from a wider array of inputs. The exploration of dynamically adjustable fusion mechanisms, tailored to the specific requirements of the input data or the task at hand, represents a promising direction for future research.

Conclusion

In sum, CREMA represents a significant step forward in multimodal learning, offering a highly efficient and flexible framework for video reasoning tasks. Its modular design not only ensures computational efficiency but also allows for the seamless integration of new modalities, catering to the ever-evolving landscape of AI applications. The insights and methodologies presented lay a foundational stone for future advancements in the field, heralding a new era in AI's capability to process and understand the multimodal world.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Shoubin Yu (15 papers)
Jaehong Yoon (43 papers)
Mohit Bansal (304 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/shoubin621/status/1756005670695186574

https://twitter.com/mohitban47/status/1884311268204634617

https://twitter.com/shoubin621/status/1756005687698788841

https://twitter.com/xwestein/status/1758241116783357994

https://twitter.com/arxivsanitybot/status/1756135493492764933

https://twitter.com/gm8xx8/status/1755774550510354913