This paper excerpt describes a method for sparsifying LLM components, guided by a metric called FDT75 which quantifies degradation caused by sparsification. The goal is to achieve a target sparsity level for the entire model by determining appropriate sparsity levels for individual components.
The core of the method is an iterative algorithm that, in each step, refines the sparsity levels for each component. The algorithm uses the FDT75 metric to understand how much degradation is introduced when a specific component is sparsified. For a given component c, the algorithm pre-calculates the FDT75 value for four levels of added sparsity to that component: 0%, t/2, t+t/2, and 100%, where t is the overall target sparsity for the current iteration. These pre-calculated values form a set of points (sparsity percentage, FDT75) for each component.
The algorithm then works by starting with the maximum possible FDT75 degradation value (assumed to be 100 in the excerpt, representing no degradation or the maximum acceptable degradation). It iteratively decreases this FDT75 threshold, denoted by f. For each value of f, it uses linear interpolation on the pre-calculated points for each component to estimate the maximum sparsity percentage that can be applied to that component while keeping its FDT75 value at or below f.
After determining these potential component sparsities for a given f, the algorithm calculates the weighted average sparsity across all components (weighted by their respective weight counts). If this weighted average meets or exceeds the target sparsity t for the iteration, the current set of individual component sparsities is considered the solution for this iteration, and the algorithm outputs these values. If the weighted average sparsity is still below the target t, the algorithm decreases f by a small amount (e.g., 1) and repeats the process. This iterative decrease in f continues until the target sparsity is reached or the FDT75 threshold f reaches 0.
The output of one iteration is a map specifying the desired sparsity for each individual component, which collectively achieves the overall target sparsity t while attempting to minimize degradation as measured by FDT75. This component-wise sparsity information can then be used to apply pruning or potentially inform quantization strategies, optimizing model size and efficiency while controlling performance impact.
In essence, the algorithm provides a structured way to navigate the trade-off between model sparsity (and thus size/efficiency) and performance degradation, using a specific metric (FDT75) to guide component-level decisions. It prioritizes components whose sparsification causes less degradation according to the FDT75 metric, allowing for higher sparsity in those parts of the model while maintaining a target overall sparsity.