Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization (2311.01544v3)

Published 2 Nov 2023 in cs.CL and cs.LG

Abstract: LLMs have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components' impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually -- and that FDTM can identify those -- while standard metrics result in deteriorated outcomes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Björn Deiseroth (16 papers)
  2. Max Meuer (1 paper)
  3. Nikolas Gritsch (5 papers)
  4. Constantin Eichenberg (8 papers)
  5. Patrick Schramowski (48 papers)
  6. Matthias Aßenmacher (20 papers)
  7. Kristian Kersting (205 papers)
Citations (4)

Summary

This paper excerpt describes a method for sparsifying LLM components, guided by a metric called FDT75_{75} which quantifies degradation caused by sparsification. The goal is to achieve a target sparsity level for the entire model by determining appropriate sparsity levels for individual components.

The core of the method is an iterative algorithm that, in each step, refines the sparsity levels for each component. The algorithm uses the FDT75_{75} metric to understand how much degradation is introduced when a specific component is sparsified. For a given component cc, the algorithm pre-calculates the FDT75_{75} value for four levels of added sparsity to that component: 0%, t/2t/2, t+t/2t+t/2, and 100%, where tt is the overall target sparsity for the current iteration. These pre-calculated values form a set of points (sparsity percentage, FDT75_{75}) for each component.

The algorithm then works by starting with the maximum possible FDT75_{75} degradation value (assumed to be 100 in the excerpt, representing no degradation or the maximum acceptable degradation). It iteratively decreases this FDT75_{75} threshold, denoted by ff. For each value of ff, it uses linear interpolation on the pre-calculated points for each component to estimate the maximum sparsity percentage that can be applied to that component while keeping its FDT75_{75} value at or below ff.

After determining these potential component sparsities for a given ff, the algorithm calculates the weighted average sparsity across all components (weighted by their respective weight counts). If this weighted average meets or exceeds the target sparsity tt for the iteration, the current set of individual component sparsities is considered the solution for this iteration, and the algorithm outputs these values. If the weighted average sparsity is still below the target tt, the algorithm decreases ff by a small amount (e.g., 1) and repeats the process. This iterative decrease in ff continues until the target sparsity is reached or the FDT75_{75} threshold ff reaches 0.

The output of one iteration is a map specifying the desired sparsity for each individual component, which collectively achieves the overall target sparsity tt while attempting to minimize degradation as measured by FDT75_{75}. This component-wise sparsity information can then be used to apply pruning or potentially inform quantization strategies, optimizing model size and efficiency while controlling performance impact.

In essence, the algorithm provides a structured way to navigate the trade-off between model sparsity (and thus size/efficiency) and performance degradation, using a specific metric (FDT75_{75}) to guide component-level decisions. It prioritizes components whose sparsification causes less degradation according to the FDT75_{75} metric, allowing for higher sparsity in those parts of the model while maintaining a target overall sparsity.