EvoPress: Accurate Dynamic Model Compression via Evolutionary Search (2410.14649v2)

Published 18 Oct 2024 in cs.LG

Abstract: The high computational costs of LLMs have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on estimating the importance of a given layer, implicitly assuming that layers contribute independently to the overall compression error. We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance. To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/EvoPress}.

Summary

The paper introduces EvoPress, an evolutionary search framework for dynamic LLM compression that refutes the conventional error monotonicity assumption underpinning previous methods.
EvoPress achieves state-of-the-art results across various compression techniques, demonstrating efficiency and rapid convergence on standard hardware like an RTX 3090 GPU.
This work provides a more efficient and nuanced methodology for reducing LLM size while maintaining accuracy, which is crucial for deploying AI models in resource-constrained environments.

An Overview of EvoPress: Toward Optimal Dynamic Model Compression via Evolutionary Search

In contemporary AI research, the burgeoning size of LLMs such as those from the Llama, Mistral, and Phi families poses significant computational challenges. This paper presents EvoPress, a novel framework for dynamic model compression that circumvents conventional assumptions underpinning previous methods, such as the error monotonicity assumption. Conventional methods typically operate under the premise that the net compression error correlates with the sum of layer-specific errors, which the authors adeptly debunk in their work.

The researchers introduce EvoPress, an evolutionary search-based framework with theoretical performance guarantees, addressing the complexities associated with non-uniform LLM compression. EvoPress efficiently explores per-layer compression alternatives to achieve optimal accuracy under set compression constraints, outperforming benchmark methods in experiments involving layer dropping, one-shot sparsification, and quantization.

Key Contributions and Results

Non-Monotonic Error Refutation: The research begins by challenging the error monotonicity assumption that equates lower cumulative per-layer errors with reduced global errors. Illustrative empirical evidence shatters this notion, suggesting that models with higher per-layer error sums can yield better outcomes.
EvoPress Implementation: The proposed evolutionary framework, EvoPress, is instantiated to resolve non-uniform compression tasks, dynamically adapting layer compression levels to minimize accuracy loss. Utilizing an iterative candidate search, the algorithm constructs models from individual level databases and evaluates fitness based on the deviation from the base model's output.
Experimental Validation: The empirical evidence in the paper showcases EvoPress's prowess across multiple compression strategies, achieving state-of-the-art results. For instance, EvoPress manages significant improvements for Llama-family models in perplexity metrics and in zero-shot evaluations by optimizing layer-wise quantization and user-defined dropping sparsity thresholds.
Efficiency and Scalability: The proposed method distinguishes itself via rapid convergence, capable of running efficiently on widely available hardware (such as an RTX 3090 GPU). The researchers highlight a substantial reduction in execution time compared to conventional methods by utilizing fewer model evaluations.

Implications and Future Directions

The implications of this work are substantial both in theory and practice. By refuting the assumption of error monotonicity and demonstrating EvoPress's efficacy, the foundations of LLM compression are redefined. Practically, this offers a more nuanced and efficient methodology for model size reduction while maintaining functional accuracy—a critical enhancement for deploying AI models in real-world applications where resources are limited.

Noteworthy is the paper's discussion on EvoPress's architecture and compression type-considerate agnosticism, which extends its applicability beyond the tested models and methods. The insight that different compression techniques require unique error sensitivities underscores the importance of automated adaptability—a principle embedded in EvoPress's design.

This paper opens avenues for future research, such as the integration of spontaneous compression methods within EvoPress’s search architecture or refining their framework to incorporate newer compression paradigms, such as structured pruning. Furthermore, further exploration of extending EvoPress to entire inference ecosystems, potentially with adaptive inference speed constraints, can consolidate its practicality for widespread use.

In conclusion, this work on EvoPress underscores an innovative stride in the direction of dynamic compression strategies. By honing an approach with validated efficiency and theoretical rigor, the authors pave the way for more compact, resource-economical, and powerful AI systems, cementing a pivotal cornerstone in the advancement of machine learning model optimization techniques.