Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Faithful Model Evaluation for Model-Based Metrics (2312.17254v1)

Published 19 Dec 2023 in cs.CL

Abstract: Statistical significance testing is used in NLP to determine whether the results of a study or experiment are likely to be due to chance or if they reflect a genuine relationship. A key step in significance testing is the estimation of confidence interval which is a function of sample variance. Sample variance calculation is straightforward when evaluating against ground truth. However, in many cases, a metric model is often used for evaluation. For example, to compare toxicity of two LLMs, a toxicity classifier is used for evaluation. Existing works usually do not consider the variance change due to metric model errors, which can lead to wrong conclusions. In this work, we establish the mathematical foundation of significance testing for model-based metrics. With experiments on public benchmark datasets and a production system, we show that considering metric model errors to calculate sample variances for model-based metrics changes the conclusions in certain experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. An empirical investigation of statistical significance in nlp. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 995–1005.
  2. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
  3. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872.
  4. Replicability analysis for natural language processing: Testing significance with multiple datasets. Transactions of the Association for Computational Linguistics, 5:471–486.
  5. The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1383–1392.
  6. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.
  7. Robertaiq: An efficient framework for automatic interaction quality estimation of dialogue systems. In KDD 2021 Workshop on Data-Efficient Machine Learning.
  8. Toxigen: A large-scale machine-generated dataset for implicit and adversarial hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.
  9. Better language models and their implications. OpenAI Blog https://openai. com/blog/better-language-models, 1(2).
  10. Edwin D Simpson. 2021. Statistical significance testing for natural language processing.
  11. Michael Smithson. 2003. Confidence intervals. 140. Sage. Confidence Statements and Interval Estimates.
  12. Stephen So. 2008. Why is the sample variance a biased estimator? Griffith University, Tech. Rep., 9.

Summary

We haven't generated a summary for this paper yet.