Papers
Topics
Authors
Recent
Search
2000 character limit reached

Locality-Aware Redundancy Pruning for LLM Depth Compression

Published 27 May 2026 in cs.LG and cs.AI | (2605.27786v1)

Abstract: LLMs are known to contain representational redundancy across network depth, making depth pruning an effective approach for improving inference efficiency. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions across architectures. We propose Locality-Aware Redundancy Pruning (LoRP), a training-free one-shot depth pruning framework guided by representation locality. We show that inter-layer redundancy can be either localized or globally distributed depending on the LLM architecture. To characterize this phenomenon, we introduce Representation Locality Score (RLS), derived from global inter-layer hidden-state similarity. Using a small calibration set, LoRP computes pairwise layer similarity, clusters layers by representational similarity, and allocates pruning according to residual intra-cluster redundancy. Experiments across diverse LLM families show improvements in both perplexity and downstream task accuracy.

Summary

  • The paper introduces the Representation Locality Score (RLS) to quantify inter-layer redundancy, guiding adaptive, training-free pruning strategies.
  • The LoRP framework leverages spectral clustering and a two-stage pruning process to achieve lower perplexity and improved downstream reasoning accuracy.
  • Empirical evaluations reveal that architecture-dependent, locality-aware pruning outperforms fixed heuristic methods, especially for models with distributed redundancy.

Locality-Aware Redundancy Pruning for LLM Depth Compression

Motivation and Representation Redundancy in Transformer Architectures

LLMs with deep Transformer stacks exhibit substantial representational redundancy across network depth, leading to opportunities for compression via layer-wise pruning. Previous training-free pruning approaches have generally assumed consistent redundancy structures across architectures or relied on heuristics based on local layer importance or contiguous depth regions, which can obscure heterogeneous redundancy patterns present in different LLMs. This paper introduces a formal analysis, using inter-layer hidden-state cosine similarity matrices, to demonstrate that some architectures have strongly localized redundancy (layer similarity concentrated near diagonal), while others present globally distributed redundancy (high similarity off-diagonal). This observation motivates the development of an architecture-dependent redundancy quantification, which is essential for effective depth pruning in Transformer-based LLMs. Figure 1

Figure 1: Pairwise inter-layer hidden-state cosine similarity matrices SS for five pre-trained LLMs; higher SijS_{ij} values indicate strong representational overlap between corresponding layers.

Representation Locality Score (RLS) and Depth Pruning Framework

The Representation Locality Score (RLS) is introduced as a global metric summarizing inter-layer similarity, quantifying the decay of representational overlap across depth. Higher RLS values indicate more localized redundancy, while lower values reflect distributed similarity. The paper leverages RLS as a guiding principle for training-free, one-shot depth pruning. The Locality-Aware Redundancy Pruning (LoRP) framework proceeds as follows: hidden states from all blocks are extracted on a calibration corpus, a pairwise similarity matrix is computed, spectral clustering is applied to partition layers into representational clusters based on the affinity matrix, and a two-stage pruning procedure is performed. Stage 1 removes one layer per cluster (coverage-aware), while Stage 2 prunes further according to intra-cluster redundancy. Figure 2

Figure 2: Overview of LoRP—hidden-state collection, similarity matrix computation, RLS summarization, spectral clustering, and two-stage pruning allocation.

Empirical Characterization of Redundancy Structures

Empirical analysis across five open-source LLMs reveals pronounced variance in redundancy patterns. LLaMA, OLMo, and Mistral exhibit steep similarity decay (localized redundancy), with higher RLS. In contrast, Qwen models manifest broadly distributed similarity (lower RLS), necessitating distributed pruning rather than contiguous layer removal. Figure 3

Figure 3: Mean inter-layer cosine similarity as a function of normalized layer distance; slower decay implies distributed redundancy.

Experimental Evaluation: Perplexity and Reasoning Performance

The effectiveness of LoRP is validated through perplexity and downstream reasoning benchmarks, utilizing nine zero-shot tasks on (ARC, HellaSwag, WinoGrande, BoolQ, OBQA, RTE, COPA, RACE). Across aggressive pruning budgets, LoRP consistently achieves lower perplexity and higher downstream accuracy compared to ShortGPT, LLM-Streamline, LaCo, and other baselines. Notably, LoRP’s architecture-adaptive pruning is especially robust for Qwen models, where other methods suffer significant degradation due to inappropriate contiguity assumptions. These results contravene the standard belief that fixed pruning heuristics suffice, highlighting the importance of locality-aware, architecture-dependent allocations. Figure 4

Figure 4: Visualization of pruned layer indices selected by multiple methods; LoRP yields architecture-dependent, distributed patterns for Qwen and localized patterns for LLaMA, Mistral.

Cluster Granularity and Ablation Study

Cluster granularity is critical: models with globally distributed redundancy benefit from larger KK in spectral clustering; those with localized redundancy are less sensitive to this parameter. Ablation shows that LoRP remains competitive or superior even as KK is varied, further supporting the need for redundancy-adaptive clustering. Figure 5

Figure 5: Effect of the number of representational clusters KK on downstream accuracy for different LLM architectures.

Implications and Future Directions

LoRP demonstrates that practical depth pruning must be guided by architecture-specific representation geometry, with RLS serving as a principled global metric. The findings demand a reassessment of current pruning strategies, advocating for explicit locality modeling in compression pipelines. The method’s training-free, one-shot design makes it readily deployable without recovery steps, although integration with post-pruning tuning could further increase compression ratios. Extending LoRP to broader architectures, attention-level redundancy analysis, and fully automated cluster selection are promising directions.

Conclusion

This work establishes that representational redundancy across depth is architecture-dependent and not universally contiguous. The proposed LoRP framework, guided by the Representation Locality Score, allocates pruning decisions in a manner reflective of each model’s redundancy structure, achieving improved perplexity and downstream accuracy over prior methods. The empirical results compel the community to abandon universal pruning heuristics in favor of locality-aware, adaptive compression strategies for Transformer LLMs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 5 likes about this paper.