Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization (2311.01544v3)

Published 2 Nov 2023 in cs.CL and cs.LG

Abstract: LLMs have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components' impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually -- and that FDTM can identify those -- while standard metrics result in deteriorated outcomes.

Authors (7)

Björn Deiseroth (16 papers)
Max Meuer (1 paper)
Nikolas Gritsch (5 papers)
Constantin Eichenberg (8 papers)
Patrick Schramowski (48 papers)
Matthias Aßenmacher (20 papers)
Kristian Kersting (205 papers)

Citations (4)

View on Semantic Scholar

Summary

This paper excerpt describes a method for sparsifying LLM components, guided by a metric called FDT $_{75}$ which quantifies degradation caused by sparsification. The goal is to achieve a target sparsity level for the entire model by determining appropriate sparsity levels for individual components.

The core of the method is an iterative algorithm that, in each step, refines the sparsity levels for each component. The algorithm uses the FDT $_{75}$ metric to understand how much degradation is introduced when a specific component is sparsified. For a given component $c$ , the algorithm pre-calculates the FDT $_{75}$ value for four levels of added sparsity to that component: 0%, $t/2$ , $t+t/2$ , and 100%, where $t$ is the overall target sparsity for the current iteration. These pre-calculated values form a set of points (sparsity percentage, FDT $_{75}$ ) for each component.

The algorithm then works by starting with the maximum possible FDT $_{75}$ degradation value (assumed to be 100 in the excerpt, representing no degradation or the maximum acceptable degradation). It iteratively decreases this FDT $_{75}$ threshold, denoted by $f$ . For each value of $f$ , it uses linear interpolation on the pre-calculated points for each component to estimate the maximum sparsity percentage that can be applied to that component while keeping its FDT $_{75}$ value at or below $f$ .

After determining these potential component sparsities for a given $f$ , the algorithm calculates the weighted average sparsity across all components (weighted by their respective weight counts). If this weighted average meets or exceeds the target sparsity $t$ for the iteration, the current set of individual component sparsities is considered the solution for this iteration, and the algorithm outputs these values. If the weighted average sparsity is still below the target $t$ , the algorithm decreases $f$ by a small amount (e.g., 1) and repeats the process. This iterative decrease in $f$ continues until the target sparsity is reached or the FDT $_{75}$ threshold $f$ reaches 0.

The output of one iteration is a map specifying the desired sparsity for each individual component, which collectively achieves the overall target sparsity $t$ while attempting to minimize degradation as measured by FDT $_{75}$ . This component-wise sparsity information can then be used to apply pruning or potentially inform quantization strategies, optimizing model size and efficiency while controlling performance impact.

In essence, the algorithm provides a structured way to navigate the trade-off between model sparsity (and thus size/efficiency) and performance degradation, using a specific metric (FDT $_{75}$ ) to guide component-level decisions. It prioritizes components whose sparsification causes less degradation according to the FDT $_{75}$ metric, allowing for higher sparsity in those parts of the model while maintaining a target overall sparsity.

PDF Markdown

Related Papers

Find Related Papers