EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models (2402.00518v1)

Published 1 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: This work introduces EE-Tuning, a lightweight and economical solution to training/tuning early-exit LLMs. In contrast to the common approach of full-parameter pre-training, EE-Tuning augments any pre-trained (and possibly fine-tuned) standard LLM with additional early-exit layers that are tuned in a parameter-efficient manner, which requires significantly less computational resources and training data. Our implementation of EE-Tuning achieves outstanding training efficiency via extensive performance optimizations, as well as scalability due to its full compatibility with 3D parallelism. Results of systematic experiments validate the efficacy of EE-Tuning, confirming that effective early-exit LLM inference can be achieved with a limited training budget. In hope of making early-exit LLMs accessible to the community, we release the source code of our implementation of EE-Tuning at https://github.com/pan-x-c/EE-LLM.

PDF Abstract

An Overview of EE-Tuning: A Parameter-Efficient Method for Tuning Early-Exit LLMs

The paper explores EE-Tuning, a method devised to efficiently augment LLMs with early-exit capabilities. Unlike traditional methodologies that require full-parameter pre-training, EE-Tuning introduces additional early-exit layers to any pre-trained LLMs. These early-exit layers are trained in a parameter-efficient manner, reducing computational resources and required training data.

Methodology and Architecture

EE-Tuning is structured as a two-stage procedure. First, the methodology entails loading a pre-trained LLM model and enhancing it with early-exit layers placed at specified points. The initialization process of these early-exit layers is pivotal. It not only promotes accelerated convergence but also helps in leveraging the knowledge embedded in the pre-trained LLM. The paper explores different early-exit architectures such as Embedding, Norm, MLP, and Layer, each varying in complexity and the number of trainable parameters.

The second stage of EE-Tuning involves training only the model parameters associated with these early-exit layers, while the original LLM's parameters remain static. The experiment establishes that the training process is efficient and results in models that can perform with reduced computation and memory usage. This training leverages automatic differentiation and a small batch size to ensure parameter efficiency and generalization.

Experimental Findings

Extensive experiments were conducted on LLMs up to 70 billion parameters using the Llama 2-Chat models as the foundation. The empirical results reveal that EE-Tuning significantly conserves computation, with models being converted within 20 to 120 GPU hours, depending on the model size, showing compatibility even with a single GPU in certain configurations.

Downstream evaluations indicate that early-exit models achieve approximately $1.2\times$ to $1.6\times$ speedup on various tasks without degrading output quality. Early-exit models can maintain or even exceed the performance benchmarks set by conventional models, primarily by preventing overthinking and retaining alignment with human preferences during inference.

Implications and Future Directions

The practical implications of this research lie in its potential to democratize access and application of early-exit models across a broader set of researchers and developers. As EE-Tuning exhibits compatibility with massive parallelism configurations, it holds promise for scalability, thus catering to applications necessitating large-scale deployment.

The potential for future work is vast. The research opens avenues for integrating other objectives beyond autoregressive LLMing into early-exit training. Furthermore, adapting EE-Tuning within the framework of knowledge distillation could enhance its applicability. Exploration into refining dynamic token-weighting methods could yield further improvements in model performance.

The discussion delineates the balance between maintaining high model performance and efficiency. Although limitations such as constraints in model adaptability during EE-Tuning are acknowledged, the methodology still provides a feasible solution under typical resource constraints. Consequently, EE-Tuning emerges as a viable method for efficiently implementing early-exit mechanisms in existing LLMs, with significant implications for both computational linguistics and the wide array of domains employing LLMs.