Omni-SMoLA: A Scalable Approach for Enhancing Vision-and-LLMs with Soft Mixture of Low-rank Experts
One of the challenges facing large multimodal models (LMM), which process and generate content that includes different forms of data like images and text, is how to maintain performance levels while being adapted to a wide range of tasks. Typically, fine-tuning these models on too many tasks may lead to decreased effectiveness. This is where Mixture of Experts (MoE) architectures come into play, particularly for instruction tuning where a model is adapted to respond to specific instructions or tasks.
However, applying MoE architectures to large-scale models in the field of 50 to 100 billion parameters presents a significant computational cost. The sheer volume of parameters involved in replicating and storing multiple expert models limits the practicability of using a large number of experts.
The paper presents an architecture called Omni-SMoLA, which uses a Soft MoE (SMoLA) approach to mix many multimodal low-rank experts, without introducing a significant number of new parameters compared to conventional MoE models. The key idea is to use lightweight experts that can be added to an existing base model to learn specialized knowledge, whether that be modality-specific or multimodal. These experts are designed to handle different tasks by focusing on particular types of tokens, such as text or visual, enabling the model to adapt to various requirements without a significant increase in parameters.
Omni-SMoLA demonstrates substantial improvements over standard fine-tuning in experiments conducted using vision-and-language tasks like image captioning and visual question answering. When applied to foundational models called PaLI-3 and PaLI-X, Omni-SMoLA achieved state-of-the-art (SoTA) performance across a broad range of generative tasks. This performance often matched or outstripped individual specialized LMM baselines, as well as established new SoTA records for specific tasks.
A further advantage of the Omni-SMoLA design is its parameter efficiency and low time complexity at inference. Despite the additional low-rank experts, the inference speed is only slightly slower compared to the base models, underlining the efficacy and efficiency of the design. The architecture is also adaptable, with the ability to include more experts as demands evolve, without needing an extensive parameter overhaul—a notable departure from traditional scaling methods.
Omni-SMoLA was evaluated on multiple settings, with variations including the number of experts used, different configurations of token modality experts, and varying the base models for initialization. The findings suggest that Omni-SMoLA outperforms the conventional full-model fine-tuned baselines for both PaLI-3 and PaLI-X, setting new SoTA results on multiple benchmarks under both generalist and specialist settings.
The architecture's ability to maintain high performance for a diverse range of tasks without a significant penalty to efficiency or scalability addresses a critical issue in the development and deployment of versatile and potent LMMs. In summary, Omni-SMoLA provides a solution to adapt large models to specialized tasks efficiently while enhancing their capacity to handle a wide array of applications.