Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging the true depth of LLMs (2502.02790v2)

Published 5 Feb 2025 in cs.LG and cs.CL

Abstract: LLMs demonstrate remarkable capabilities at the cost of high compute requirements. Recent studies have demonstrated that intermediate layers in LLMs can be removed or reordered without substantial accuracy loss; however, this insight has not yet been exploited to improve inference efficiency. Leveraging observed layer independence, we propose a novel method that groups consecutive layers into pairs evaluated in parallel, effectively restructuring the computational graph to enhance parallelism. Without requiring retraining or fine-tuning, this approach achieves an inference throughput improvement of 1.05x-1.20x on standard benchmarks, retaining 95\%-99\% of the original model accuracy. Empirical results demonstrate the practicality of this method in significantly reducing inference cost for large-scale LLM deployment. Additionally, we demonstrate that modest performance degradation can be substantially mitigated through lightweight fine-tuning, further enhancing the method's applicability.

The paper "Leveraging the true depth of LLMs" introduces a novel approach to enhance the inference speed of LLMs by exploiting the redundancy in their depth. The key idea is to parallelize the execution of consecutive transformer layers, thereby reducing the effective depth of the model without significantly impacting performance. This method, termed Layer Parallelism, involves grouping some layers into pairs that can be evaluated in parallel, leading to an improved rate of tokens generated per second.

The authors make the following contributions:

  • They explore different intervention strategies on pre-trained LLM layers and discover that certain transformations, particularly contiguous parallelization, preserve model performance.
  • They define a parallelization transform on the computational graph of two sequential Transformer layers and demonstrate that this operation can be stacked across several sequential pairs of layers without substantial performance degradation.
  • They exploit the parallelization of the computational graph to achieve approximately 1.20x faster model execution using multiple GPUs, while maintaining In-Context Learning (ICL) capabilities.
  • They demonstrate that fine-tuning a parallelized model can recover some of the lost performance while preserving the speed-up.

The paper begins by addressing the computational challenges associated with large LLMs, emphasizing the importance of efficient inference for reducing operational costs and latency. It reviews existing literature on pruning, quantization, and parallelism as methods for enhancing computational efficiency. The paper also discusses research on layer-level optimization strategies, such as the Staircase Transformer and Staggering Transformer.

The authors investigate the effective depth of LLMs by applying transformations such as shuffling, merging, and pruning to pre-trained Transformer layers. They find that shuffling large stretches of blocks has a surprisingly low impact on perplexity, suggesting that many layers operate at the same level of abstraction. This observation leads to the concept of effective depth, defined as the shortest depth required to efficiently leverage existing latent representations without significant performance loss.

The parallel execution strategy involves modifying the computational graph to allow divergent paths for consecutive transformer layers k\ell_{k} and k+1\ell_{k+1}. The standard sequential output for these layers, given an input xx, is represented by equation (SEQ):

y=amp;x+A<em>k(x)+Fk(x+Ak(x)) amp;+A</em>k+1(x+A<em>k(x)+Fk(x+Ak(x))) amp;+F</em>k+1(x+A<em>k(x)+Fk(x+Ak(x)) amp;+A</em>k+1(x+Ak(x)+Fk(x+Ak(x)))))\qquad \begin{aligned} y = &amp;x + \text{A}<em>k(x) + \text{F}_k(x + \text{A}_k(x)) \ &amp;+ \text{A}</em>{k+1}(x + \text{A}<em>k(x) + \text{F}_k(x + \text{A}_k(x))) \ &amp;+ \text{F}</em>{k+1}(x + \text{A}<em>k(x) + \text{F}_k(x + \text{A}_k(x)) \ &amp;+\text{A}</em>{k+1}(x + \text{A}_k(x) + \text{F}_k(x + \text{A}_k(x))))) \end{aligned}

where:

  • xx is the input
  • Ak()\text{A}_k(\cdot) is the attention sub-block
  • Fk()\text{F}_k(\cdot) is the feed-forward sub-block

The parallel approximation is represented by equation (PAR):

y^=x+Ak(x)+Fk(x+Ak(x))+Ak+1(x)+Fk+1(x+Ak+1(x))\qquad \hat{y} = x + \text{A}_k(x) + \text{F}_k(x + \text{A}_k(x)) + \text{A}_{k+1}(x) + \text{F}_{k+1}(x + \text{A}_{k+1}(x))

This approximation enables parallel execution of blocks k\ell_{k} and k+1\ell_{k+1} through divergent computational paths.

The authors address challenges in parallelizing transformer blocks, such as the need to leverage efficient GPU kernels and the saturation of GPU resources. They extend the tensor parallelism scheme from Megatron to incorporate Layer Parallelism. In Layer Parallel Multi-Head Attention (MHA), the depth of the query, key, and value weight matrices (WQ,WK,WVR(nhhd)×DW_Q, W_K, W_V \in \mathbb{R}^{(n_h \cdot h_d) \times D}) and the output projection (WORD×(nhhd)W_O \in \mathbb{R}^{D \times (n_h \cdot h_d)}) are adjusted to enable parallel computation.

In Layer Parallel Feed Forward Network (FFN), the first layer's output dimensionality is doubled, and separate output projections are performed for each layer. The authors also address the challenges of handling Layer Normalization, applying separate normalization for MHA pre-normalization and interpolating weights for FFN pre-normalization.

The authors evaluate Layer Parallelism across three dimensions: inference speed improvements, impact on In-Context Learning (ICL) performance, and the potential to recover model accuracy through fine-tuning. They use Llama 2 7B and Llama 3.2 3B models on a node with two A100 SXM4 80Gb GPUs. The ICL 5-shot accuracies are measured using the \verb|lm-eval| package across tasks such as MMLU, PiQA, ARC Easy, ARC Challenge, Winogrande, OpenBookQA, and Hellaswag. Perplexity is evaluated against a subset of the test set of RedPajama.

The results indicate that both Llama 2 7B and Llama 3.2 3B models exhibit a common sequence ending index for which perplexity is minimized (28 and 25, respectively). For Llama 2 7B, Layer Parallelism on sequences greater than 14 layers results in a steep loss of ICL capabilities. Similarly, parallelizing above 10 layers of Llama 3.2 3B shows a rapid decrease in performance. The effective depth of both models at the parallel configurations before these drops is 25 and 23, a reduction of 21% and 18% of their original depths, respectively.

The speed gain is directly proportional to the reduction of the effective depth of the model. For an effective depth of 25 (Δ=14\Delta=14) in Llama 2 7B, an average speed-up of 1.29x is observed at the largest sequence length in the 1-token generation task. For an effective depth of 23 (Δ=10\Delta=10) in Llama 3.2 3B, a speed-up of 1.22x is reported.

Fine-tuning the parallelized layers on random samples from RedPajama's training set improves MMLU accuracy from 83.6% to 94.4% of the baseline performance, demonstrating that much of the model's original capability can be recovered while maintaining the speed benefits of Layer Parallelism.

The paper acknowledges limitations, including variations in effectiveness across model scales and a performance degradation compared to the baseline. Determining the optimal configuration of parallel layer pairs remains an open challenge.

In conclusion, the paper presents Layer Parallelism, a method that exploits independence patterns between transformer layers to optimize LLM inference. By restructuring the computational graph to enable parallel execution of consecutive layer pairs, the authors achieve substantial speed improvements without model retraining. The method reduces the effective depth of Llama 2 7B by 21% while maintaining performance, resulting in up to a 1.29x improvement in inference speed. Fine-tuning can recover 10.8% of ICL accuracy on MMLU.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ramón Calvo González (1 paper)
  2. Daniele Paliotta (8 papers)
  3. Matteo Pagliardini (15 papers)
  4. Martin Jaggi (155 papers)
  5. François Fleuret (78 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com