A Survey of Resource-efficient LLM and Multimodal Foundation Models (2401.08092v2)

Published 16 Jan 2024 in cs.LG, cs.AI, and cs.DC

Abstract: Large foundation models, including LLMs, vision transformers (ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine learning lifecycle, from training to deployment. However, the substantial advancements in versatility and performance these models offer come at a significant cost in terms of hardware resources. To support the growth of these large models in a scalable and environmentally sustainable way, there has been a considerable focus on developing resource-efficient strategies. This survey delves into the critical importance of such research, examining both algorithmic and systemic aspects. It offers a comprehensive analysis and valuable insights gleaned from existing literature, encompassing a broad array of topics from cutting-edge model architectures and training/serving algorithms to practical system designs and implementations. The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field.

PDF Abstract

Overview of Resource-Efficient Models

The application of LLMs and multimodal foundation models has been revolutionary in various domains of machine learning. These models have displayed exceptional performance in tasks ranging from natural language processing to computer vision. However, their versatility comes with significant resource requirements, necessitating research into the development of resource-efficient strategies.

Algorithmic and Systemic Analysis

The survey explores the importance of research in resource-efficiency for LLMs, exploring both algorithmic and systemic aspects. Algorithmic advancements comprise a comprehensive review of model architectures, while systemic aspects encompass the practical implementation within computing systems. Analyses are detailed for different types of models, including text, image, and multimodal variants.

The Architecture of Foundation Models

Language foundation models, for instance, have seen numerous architectural improvements—whether through the optimization of attention mechanisms or through dynamic neural networks. These alterations aim to streamline the processing efficiency without compromising the models' ability to learn from data. Similar advancements are observed for vision foundation models, where the emphasis is on creating efficient transformer pipelines and encoder-decoder structures.

Training and Serving Considerations

Lastly, the survey considers the entire life cycle of large foundation models, from training to serving. Strategies for distributed training, model compression, and knowledge distillation are discussed, highlighting the challenges of scaling up these models and potential solutions to mitigate resource demands. Serving systems for foundation models, which facilitate their practical usage, are also assessed for their efficiency in handling various deployment scenarios, including cloud and edge computing environments.

In conclusion, current research efforts are consistently pushing the boundaries of resource-efficiency in foundation models. As the field continues to evolve, future breakthroughs are expected to further enhance the effectiveness of these models while reducing their impact on computational resources.