Introduction
Accelerating the inference process of LLMs is a continuous challenge due to their significant computational demands. Traditionally, inference quality has been tightly coupled to the use of a model's final layer, leaving intermediate layers underutilized. Varshney et al. target this inefficiency by exploring the potential of intermediate layers in LLMs for text generation tasks. Their approach, instruction tuning with Losses from the InTermediate layErs (LITE), retains the generative capabilities of intermediate layers without sacrificing the final layer's performance.
Accelerating Inference through Intermediate Layer Utilization
The authors identify a key limitation of LLMs tuned only on final layers: while the last layer is well-optimized for high-quality text generation, intermediate layers are not. This single-layer dependency limits the possibility of early exiting, where one could stop the forward pass through the model at an intermediate point to save computation—doing so would typically degrade output quality. To rectify this, instruction tuning with LITE is proposed, allowing intermediate layers to produce quality text outputs.
Instruction Tuning with LITE
LITE enables a weighted contribution of loss from intermediate layers, fostering better alignment and generation capacity within these layers without affecting the final layer's output. To validate this approach, the authors present experimental results on instruction tuning LLaMA-2 models with the Alpaca dataset and holistic evaluation across four different human-instruction test sets. The experiments demonstrate that while intermediate layers do not inherently possess high-quality generation capabilities, with LITE, they indeed gain such abilities.
Dynamic Confidence-Based Early Exiting
Building on LITE, dynamic confidence-based early exiting is introduced. It relies on the probability signals from intermediate layers' token predictions to determine alignment with the final layer's output, effectively deciding on-the-fly whether to generate the next token early. Results from this method indicate significant improvements in inference efficiency without quality trade-offs. The improvement ranges between 37.86% for 7B and 46.35% for 13B models, maintaining output semantic similarity and coherence even when exiting early.
The paper contributes to the growing body of work aiming to optimize the utilization of LLMs. By enhancing the representational quality of intermediate layers and allowing dynamic exits, their approach marks a significant step toward efficient inference in resource-intensive LLMs. This methodology not only achieves computational efficiency but does so with minimal impact on generation quality, making it a promising avenue for facilitating the broader adoption of LLMs.