Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal (2402.11297v1)

Published 17 Feb 2024 in cs.CL

Abstract: Our contribution introduces a groundbreaking multimodal LLM designed to comprehend multi-images, multi-audio, and multi-images-multi-audio within a single multiturn session. Leveraging state-of-the-art models, we utilize the SigLIP encoder for visual inputs and the Whisper Encoder for audio inputs. Notably, this multimodal LLM is bilingual, proficient in understanding both English and Malay simultaneously. We proudly unveil two versions of this model: TinyLlama with 1.1B parameters, and Mistral with 7B parameters. With its ability to navigate diverse modalities and languages, our model represents a significant advancement for the Malaysian context and beyond. All models released at https://huggingface.co/collections/mesolitica/multimodal-malaysian-LLM-65c6f893e03f78fa9e5c8859

Multimodal LLMs for Multi-Image, Multi-Audio, and Multi-Turn Understanding

Introduction to MMMModal

The MMMModal initiative represents a significant stride in the development of multimodal LLMs. This research focuses on the intricate challenge of processing and understanding multiple modalities - images, audio, and text - within the context of a single multiturn dialogue session. Central to this advancement are the two versions introduced: TinyLlama, which encompasses 1.1 billion parameters, and Mistral, a vastly larger iteration with 7 billion parameters. These models are bilingual, designed to navigate fluently between English and Malay, thereby setting a precedent for multimodal LLMs in both local and global contexts.

Core Contributions

The primary contribution of this paper is the development of the MMMModal framework, capable of integrating multidisciplinary modalities - multi-images, multi-audio, and their combination - into a coherent LLM architecture. This architecture underscores a nuanced understanding not just of single modality inputs but of complex, real-world scenarios where inputs span across visual, auditory, and textual spectrums. Key to its success is the employment of adaptive synthetic data generation methods, providing robust datasets that cater to the nuanced needs of multimedia and bilingual (English and Malay) comprehension.

Synthetic Data Generation Approaches

The researchers meticulously designed methodologies for creating comprehensive synthetic datasets reflecting the diverse digital footprint of Malaysian and Singaporean locales. These methods include:

  • Utilizing pseudolabeled transcriptions of YouTube content to create audio-based instructional data.
  • Curating a visual context dataset encapsulating various aspects of Malaysian culture, from cuisine to transportation.
  • Generating multi-images and multi-audio datasets that facilitate the understanding of relationships between different media types.
  • Adopting a two-step training procedure including pretraining for feature alignment and fine-tuning with the multimodal data, enhancing the model's proficiency in bridging the gap between visual/audio inputs and textual output.

Practical Implications and Theoretical Significance

MMMModal’s architecture presents practical utilities extending beyond mere academic intrigue. Specifically, in multilingual and culturally diverse settings, the model's ability to discern and interpret complex multimodal inputs offers promising applications in AI-driven customer service, educational tools, and content moderation platforms, to name a few. From a theoretical standpoint, MMMModal serves as a substantial case paper in examining the efficacy of synthetic data and feature alignment techniques in training LLMs to comprehend multimodal inputs.

Future Directions in AI

Speculatively, this research paves the way for further exploration into video inputs and the integration of additional linguistic diversity into LLMs. By refining synthetic dataset generation and embracing a broader spectrum of modalities, future developments could bring forth models with an even more profound understanding of human interactions. This research sets a foundational step toward realizing AI systems capable of comprehensive, context-aware interpretations across the full spectrum of human communication modalities.

Acknowledgements and Conclusions

The success of MMMModal underscores a collaborative effort across various sectors, including significant contributions from the Malaysia-AI volunteer community and support from NVIDIA Inception. The MMMModal project not only showcases a leap in multimodal AI capabilities but also emphasizes the collective advancement possible through diverse contributions in the field of AI research. As we look toward the horizon of AI development, the MMMModal project exemplifies the critical importance of multimodal understanding in creating more intuitive, responsive, and culturally aware artificial intelligence systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Husein Zolkepli (4 papers)
  2. Aisyah Razak (5 papers)
  3. Kamarul Adha (5 papers)
  4. Ariff Nazhan (5 papers)