Introduction
The field of MultiModal LLMs (MM-LLMs) has seen significant expansion, leveraging pre-trained unimodal models to mitigate the computational costs associated with training from scratch. These models not only excel in natural language understanding and generation but also in processing and generating MultiModal (MM) content, thus advancing closer to artificial general intelligence.
Architectural Composition and Training Pipeline
MM-LLMs are composed of five architectural elements: Modality Encoder, Input Projector, LLM Backbone, Output Projector, and Modality Generator. The diverse modalities processed by these components underscore the complexity and capability of MM-LLMs. The training pipeline is split into MM Pre-Training (PT) and MM Instruction-Tuning (IT), focusing on enhancing the LLM's textual abilities to support MM input/output. A notable shift in the field is the realigned focus on training strategies that optimize model efficiency, given the exorbitant costs associated with training MM-LLMs.
State-of-the-Art Models
A broad spectrum of MM-LLMs, each with unique features, has been introduced to address various MM tasks. Models like Flamingo and BLIP-2 emphasize MM understanding and exhibit text generation prompted by natural language. Conversely, other models like MiniGPT-4 and MiniGPT-5 demonstrate capabilities of both input and output in multiple modalities. The evolution of technology has led to models with innovative structures such as NExT-GPT and CoDi-2, which attempt to create end-to-end MM systems without relying on a cascade of processes.
Benchmarks and Emerging Directions
The performance assessment of MM-LLMs has been standardized across numerous mainstream benchmarks, providing insight into the models' effectiveness and guiding future enhancements. Future trajectories for MM-LLMs encompass augmentation in modalities and LLMs, improvement in datasets, and progression toward any-to-any modality conversion. Furthermore, more comprehensive, practical, and challenging benchmarks are called upon to thoroughly evaluate MM-LLMs. Additionally, potent directions such as deploying lightweight models, amalgamating embodied intelligence, and advancing continual IT depict a roadmap for future research endeavors.
Understanding the intricate interplay between different modalities and harnessing the power of pre-existing LLMs, MM-LLMs continue to revolutionize the capabilities of AI systems, drawing them ever closer to mimicking human intelligence within computational limitations. This survey serves as an essential compass for researchers navigating the MM-LLMs landscape, marking pathways to uncharted terrains that await exploration.