Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 98 tok/s Pro
GPT OSS 120B 472 tok/s Pro
Kimi K2 210 tok/s Pro
2000 character limit reached

Accuracy is Not All You Need (2407.09141v1)

Published 12 Jul 2024 in cs.LG

Abstract: When LLMs are compressed using techniques such as quantization, the predominant way to demonstrate the validity of such techniques is by measuring the model's accuracy on various benchmarks.If the accuracies of the baseline model and the compressed model are close, it is assumed that there was negligible degradation in quality.However, even when the accuracy of baseline and compressed model are similar, we observe the phenomenon of flips, wherein answers change from correct to incorrect and vice versa in proportion.We conduct a detailed study of metrics across multiple compression techniques, models and datasets, demonstrating that the behavior of compressed models as visible to end-users is often significantly different from the baseline model, even when accuracy is similar.We further evaluate compressed models qualitatively and quantitatively using MT-Bench and show that compressed models are significantly worse than baseline models in this free-form generative task.Thus, we argue that compression techniques should also be evaluated using distance metrics.We propose two such metrics, KL-Divergence and flips, and show that they are well correlated.

Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that compressed LLMs exhibit minimal accuracy changes while revealing significant behavioral flips in model responses.
  • It quantifies that flips can range from 5% to 13.6%, indicating discrepancies that accuracy metrics fail to capture.
  • The study advocates adopting KL-Divergence and flips as comprehensive evaluation metrics to better reflect real-world model performance.

Analyzing Compression Metrics for LLMs: Beyond Accuracy

In the modern era of artificial intelligence, the efficiency of LLMs has garnered significant attention, leading to the development of various model compression techniques. These techniques, including quantization, key-value cache compression, pruning, and sparsification, aim to reduce the computational cost and latency of LLMs. Despite these advancements, the prevalent practice for evaluating the efficacy of compressed models has been to rely predominantly on accuracy metrics, which, as this paper argues, may be insufficient.

Revisiting the Premise: Beyond Solely Accuracy

The authors cogently argue that while the aggregate accuracy might suggest negligible differences between baseline and compressed models, this metric often masks underlying behavioral changes induced by compression. They introduce a phenomenon termed "flips," where answers change from correct to incorrect and vice versa. Through an extensive evaluation of multiple compression techniques across various datasets and models, they demonstrate that accuracy alone fails to capture significant deviations in model behavior as seen by end-users.

Detailed Evaluation Across Metrics

The paper evaluates models on both qualitative and quantitative fronts. Key observations include:

  • Accuracy Consistency: Across various benchmarks and compression schemes, the change in accuracy metric between baseline and compressed models was minimal (≤ 2%). This includes notable tasks such as MMLU, PIQA, ARC, and others.
  • Prevalence of Flips: Despite similar accuracy, a substantial number of flips (≥ 5%) were observed, with some schemes exhibiting flips as high as 13.6%. This highlights a severe divergence in model behavior that accuracy metrics fail to capture.

Empirical Insights: Quantitative and Qualitative Divergence

Figure 1 in the paper succinctly illustrates negligible accuracy differences across six quantization schemes on seven benchmark tasks, juxtaposed with substantial flips. This discrepancy underlines the inadequacy of accuracy as the sole metric for evaluating compressed models. Through experiments with layer dropping and WANDA pruning, the paper shows that flips increase in direct correlation with the compression level, reinforcing the notion that model divergence increases even if accuracy remains stable.

Distance Metrics: Proposing KL-Divergence and Flips

The authors advocate for the inclusion of distance metrics, specifically KL-Divergence and flips, to provide a more holistic evaluation of compressed models:

  • KL-Divergence: Measures the divergence in probability distributions between baseline and compressed models, serving as an indicator of underlying changes not captured by accuracy.
  • Flips: Represent a straightforward metric to understand changes in correct/incorrect answer proportions. It provides a tangible measure of model behavior changes from the user's perspective.

Their analysis shows that KL-Divergence and flips are well-correlated, suggesting that flips can serve as a reliable proxy for more complex distance metrics.

Implications of Findings

Practical Implications

For practical deployments, especially in applications requiring free-form text generation, accuracy might be an inadequate proxy for assessing model impact. The paper uses MT-Bench for qualitative evaluation and finds that higher flips correlate with worse performance in multi-turn dialogue tasks, further emphasizing the need for comprehensive metrics. For instance, the Llama2-70b chat model, when compressed using 8-bit and 4-bit quantization, exhibited notable degradation in response quality despite having minimal accuracy differences.

Theoretical Implications

Theoretically, the paper provides a compelling case for rethinking how we evaluate LLMs post-compression. It underscores the necessity for metrics that accurately reflect model behavior changes, advocating for the wider adoption and standardization of distance metrics like KL-Divergence and flips in AI research and practice.

Future Research Directions

This exploration opens multiple avenues for future research:

  • Development of New Metrics: Investigations could aim at refining existing distance metrics or developing novel ones that better capture user-centric model behaviors.
  • Extended Evaluations: Broadening the evaluation to include more diverse models and real-world applications can provide richer insights into the broader applicability and reliability of proposed metrics.
  • Compression Techniques: As model compression techniques evolve, continuous assessment and recalibration of evaluation metrics will be necessary to align them with advancements.

Conclusion

This paper makes a significant contribution to the discourse on LLM evaluation by highlighting the limitations of accuracy and proposing a shift to more comprehensive distance metrics. As the AI community progresses towards deploying more efficient LLMs, ensuring that these models retain their effectiveness in real-world applications necessitates the adoption of robust evaluation frameworks. Embracing metrics like KL-Divergence and flips can bridge the gap between model efficiency and user experience, fostering the development of LLMs that are both computationally efficient and reliable.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

HackerNews

  1. Accuracy is Not All You Need (4 points, 0 comments)