Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 13 tok/s Pro

GPT-5 High 16 tok/s Pro

GPT-4o 98 tok/s Pro

GPT OSS 120B 472 tok/s Pro

Kimi K2 210 tok/s Pro

2000 character limit reached

Accuracy is Not All You Need (2407.09141v1)

Published 12 Jul 2024 in cs.LG

Abstract: When LLMs are compressed using techniques such as quantization, the predominant way to demonstrate the validity of such techniques is by measuring the model's accuracy on various benchmarks.If the accuracies of the baseline model and the compressed model are close, it is assumed that there was negligible degradation in quality.However, even when the accuracy of baseline and compressed model are similar, we observe the phenomenon of flips, wherein answers change from correct to incorrect and vice versa in proportion.We conduct a detailed study of metrics across multiple compression techniques, models and datasets, demonstrating that the behavior of compressed models as visible to end-users is often significantly different from the baseline model, even when accuracy is similar.We further evaluate compressed models qualitatively and quantitatively using MT-Bench and show that compressed models are significantly worse than baseline models in this free-form generative task.Thus, we argue that compression techniques should also be evaluated using distance metrics.We propose two such metrics, KL-Divergence and flips, and show that they are well correlated.

Citations (1)

View on Semantic Scholar

Collections

Summary

The paper demonstrates that compressed LLMs exhibit minimal accuracy changes while revealing significant behavioral flips in model responses.
It quantifies that flips can range from 5% to 13.6%, indicating discrepancies that accuracy metrics fail to capture.
The study advocates adopting KL-Divergence and flips as comprehensive evaluation metrics to better reflect real-world model performance.

Analyzing Compression Metrics for LLMs: Beyond Accuracy

In the modern era of artificial intelligence, the efficiency of LLMs has garnered significant attention, leading to the development of various model compression techniques. These techniques, including quantization, key-value cache compression, pruning, and sparsification, aim to reduce the computational cost and latency of LLMs. Despite these advancements, the prevalent practice for evaluating the efficacy of compressed models has been to rely predominantly on accuracy metrics, which, as this paper argues, may be insufficient.

Revisiting the Premise: Beyond Solely Accuracy

The authors cogently argue that while the aggregate accuracy might suggest negligible differences between baseline and compressed models, this metric often masks underlying behavioral changes induced by compression. They introduce a phenomenon termed "flips," where answers change from correct to incorrect and vice versa. Through an extensive evaluation of multiple compression techniques across various datasets and models, they demonstrate that accuracy alone fails to capture significant deviations in model behavior as seen by end-users.

Detailed Evaluation Across Metrics

The paper evaluates models on both qualitative and quantitative fronts. Key observations include:

Accuracy Consistency: Across various benchmarks and compression schemes, the change in accuracy metric between baseline and compressed models was minimal (≤ 2%). This includes notable tasks such as MMLU, PIQA, ARC, and others.
Prevalence of Flips: Despite similar accuracy, a substantial number of flips (≥ 5%) were observed, with some schemes exhibiting flips as high as 13.6%. This highlights a severe divergence in model behavior that accuracy metrics fail to capture.

Empirical Insights: Quantitative and Qualitative Divergence

Figure 1 in the paper succinctly illustrates negligible accuracy differences across six quantization schemes on seven benchmark tasks, juxtaposed with substantial flips. This discrepancy underlines the inadequacy of accuracy as the sole metric for evaluating compressed models. Through experiments with layer dropping and WANDA pruning, the paper shows that flips increase in direct correlation with the compression level, reinforcing the notion that model divergence increases even if accuracy remains stable.

Distance Metrics: Proposing KL-Divergence and Flips

The authors advocate for the inclusion of distance metrics, specifically KL-Divergence and flips, to provide a more holistic evaluation of compressed models:

KL-Divergence: Measures the divergence in probability distributions between baseline and compressed models, serving as an indicator of underlying changes not captured by accuracy.
Flips: Represent a straightforward metric to understand changes in correct/incorrect answer proportions. It provides a tangible measure of model behavior changes from the user's perspective.

Their analysis shows that KL-Divergence and flips are well-correlated, suggesting that flips can serve as a reliable proxy for more complex distance metrics.

Implications of Findings

Practical Implications

For practical deployments, especially in applications requiring free-form text generation, accuracy might be an inadequate proxy for assessing model impact. The paper uses MT-Bench for qualitative evaluation and finds that higher flips correlate with worse performance in multi-turn dialogue tasks, further emphasizing the need for comprehensive metrics. For instance, the Llama2-70b chat model, when compressed using 8-bit and 4-bit quantization, exhibited notable degradation in response quality despite having minimal accuracy differences.

Theoretical Implications

Theoretically, the paper provides a compelling case for rethinking how we evaluate LLMs post-compression. It underscores the necessity for metrics that accurately reflect model behavior changes, advocating for the wider adoption and standardization of distance metrics like KL-Divergence and flips in AI research and practice.

Future Research Directions

This exploration opens multiple avenues for future research:

Development of New Metrics: Investigations could aim at refining existing distance metrics or developing novel ones that better capture user-centric model behaviors.
Extended Evaluations: Broadening the evaluation to include more diverse models and real-world applications can provide richer insights into the broader applicability and reliability of proposed metrics.
Compression Techniques: As model compression techniques evolve, continuous assessment and recalibration of evaluation metrics will be necessary to align them with advancements.

Conclusion

This paper makes a significant contribution to the discourse on LLM evaluation by highlighting the limitations of accuracy and proposing a shift to more comprehensive distance metrics. As the AI community progresses towards deploying more efficient LLMs, ensuring that these models retain their effectiveness in real-world applications necessitates the adoption of robust evaluation frameworks. Embracing metrics like KL-Divergence and flips can bridge the gap between model efficiency and user experience, fostering the development of LLMs that are both computationally efficient and reliable.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (4)

Tweets

https://twitter.com/fly51fly/status/1812813953430323227

https://twitter.com/abhinavdutta555/status/1812743387524571305

https://twitter.com/segyges/status/1818005943419822486

https://twitter.com/fleetwood___/status/1818341395553235116

https://twitter.com/ivanchanavinah/status/1876210497135247370

https://twitter.com/muzzkek/status/1942644883224945053

HackerNews

Accuracy is Not All You Need (4 points, 0 comments)