Efficient Lifelong Model Evaluation in an Era of Rapid Progress (2402.19472v2)
Abstract: Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. These benchmarks introduce a major challenge: the high cost of evaluating a growing number of models across very large sample sets. To address this challenge, we introduce an efficient framework for model evaluation, Sort & Search (S&S)}, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples. To test our approach at scale, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing 1.69M and 1.98M test samples for classification. Extensive empirical evaluations across over 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (about 1000x reduction) on a single A100 GPU, with low approximation error and memory cost of <100MB. Our work also highlights issues with current accuracy prediction metrics, suggesting a need to move towards sample-level evaluation metrics. We hope to guide future research by showing our method's bottleneck lies primarily in generalizing Sort beyond a single rank order and not in improving Search.
- Estimating example difficulty using variance of gradients. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Miklós Ajtai. The complexity of the pigeonhole principle. Combinatorica, 14:417–433, 1994.
- Frank B Baker. The basics of item response theory. ERIC, 2001.
- Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. In International Conference on Computer Vision (ICCV), 2023.
- Deep learning through the lens of example difficulty. Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Leaving reality to imagination: Robust classification via generated datasets. International Conference on Learning Representations Workshop (ICLR-W), 2023.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Conference on Neural Information Processing Systems (NeurIPS), 2019.
- On the dangers of stochastic parrots: Can language models be too big? In Conference on Fairness, Accountability, and Transparency (FAccT), 2021.
- Are we done with imagenet? In Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Quality meets diversity: A model-agnostic framework for computerized adaptive testing. In International Conference on Data Mining (ICDM), 2020.
- Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2616–2627, 2023.
- The ladder: A reliable leaderboard for machine learning competitions. In International Conference on Machine Learning (ICML), 2015.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Pug: Photorealistic and semantically controllable synthetic data for representation learning. arXiv preprint arXiv:2308.03977, 2023.
- What will it take to fix benchmarking in natural language understanding? In North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Hibug: On human-interpretable model debug. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement, 1960.
- Computing the testing error without a testing set. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Cinic-10 is not imagenet or cifar-10. arXiv preprint arXiv:1810.03505, 2018.
- The efficiency misnomer. arXiv preprint arXiv:2110.12894, 2021.
- Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- The spotlight: A general method for discovering systematic errors in deep learning models. In Conference on Fairness, Accountability, and Transparency (FAccT), 2022.
- Nats-bench: Benchmarking nas algorithms for architecture topology and size. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.
- Understanding dataset difficulty with v-usable information. In International Conference on Machine Learning (ICML), 2022.
- Domino: Discovering systematic errors with cross-modal embeddings. International Conference on Learning Representations (ICLR), 2022.
- Does progress on imagenet transfer to real-world datasets? In Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Balancing test accuracy and security in computerized adaptive testing. International Conference on Artificial Intelligence in Education (AIED), 2023.
- Datacomp: In search of the next generation of multimodal datasets. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Adaptive testing of computer vision models. In International Conference on Computer Vision (ICCV), 2023.
- Evaluating models’ local decision boundaries via contrast sets. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
- Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. In International Conference on Machine Learning (ICML), 2023.
- Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations (ICLR), 2018.
- Beyond accuracy: quantifying trial-by-trial behaviour of cnns and humans by measuring error consistency. Conference on Neural Information Processing Systems (NeurIPS), 2020.
- Bobcat: Bilevel optimization-based computerized adaptive testing. International Joint Conference on Artificial Intelligence (IJCAI), 2021.
- Benchmarking neural network robustness to common corruptions and perturbations. International Conference on Learning Representations (ICLR), 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In International Conference on Computer Vision (ICCV), 2021a.
- Measuring massive multitask language understanding. International Conference on Learning Representations (ICLR), 2021b.
- Natural adversarial examples. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021c.
- Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. arXiv preprint arXiv:2306.14610, 2023.
- Exploring multi-objective exercise recommendations in online education systems. In International Conference on Information and Knowledge Management (CIKM), 2019.
- Evaluation gaps in machine learning practice. In Conference on Fairness, Accountability, and Transparency (FAccT), 2022.
- Régis Pierrard Ilyas Moutawwakil. Llm-perf leaderboard. https://huggingface.co/spaces/optimum/llm-perf-leaderboard, 2023.
- Bring your own data! self-supervised evaluation for large language models. arXiv preprint arXiv:2306.13651, 2023.
- Active bayesian assessment of black-box classifiers. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7935–7944, 2021.
- Text encoders are performance bottlenecks in contrastive vision-language models. arXiv preprint arXiv:2305.14897, 2023.
- Deconstructing distributions: A pointwise framework of learning. International Conference on Learning Representations (ICLR), 2023.
- How do humans teach: On curriculum learning and teaching dimension. Advances in neural information processing systems, 24, 2011.
- Dynabench: Rethinking benchmarking in nlp. North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
- Active testing: Sample-efficient model evaluation. In International Conference on Machine Learning (ICML), 2021.
- Active surrogate estimators: An active learning approach to label-efficient model evaluation. Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Learning multiple layers of features from tiny images. 2009.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision (IJCV), 128(7):1956–1981, 2020.
- Holistic evaluation of text-to-image models. Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Are we learning yet? a meta review of evaluation failures across machine learning. In Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
- Harder or different? a closer look at distribution shift in dataset reproduction. In International Conference on Machine Learning Workshops (ICML-W), 2020.
- Data contamination: From memorization to exploitation. arXiv preprint arXiv:2203.08242, 2022.
- Model similarity mitigates test set overuse. Conference on Neural Information Processing Systems (NeurIPS), 32, 2019.
- Dataperf: Benchmarks for data-centric ai development. Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Multi-objective optimization of item selection in computerized adaptive testing. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1018–1026, 2021.
- Adversarial nli: A new benchmark for natural language understanding. Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
- Mapping global dynamics of benchmark creation and saturation in artificial intelligence. Nature Communications, 13(1):6793, 2022.
- Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. arXiv preprint arXiv:2112.07566, 2021.
- Red teaming language models with language models. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
- Efficient benchmarking (of language models). arXiv preprint arXiv:2308.11696, 2023.
- Automated classification of model errors on imagenet. Conference on Neural Information Processing Systems (NeurIPS), 2023.
- tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024.
- Dynasent: A dynamic benchmark for sentiment analysis. Dynasent: A dynamic benchmark for sentiment analysis, 2021.
- Online continual learning without the storage constraint. arXiv preprint arXiv:2305.09253, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Ai and the everything in the whole wide world benchmark. Conference on Neural Information Processing Systems (NeurIPS), 2021.
- Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018.
- Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning (ICML), 2019.
- Evaluation examples are not equally informative: How should that change nlp leaderboards? In Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
- A meta-analysis of overfitting in machine learning. Conference on Neural Information Processing Systems (NeurIPS), 2019.
- Vote’n’rank: Revision of benchmarking with social choice theory. Annual Meeting of the Association for Computational Linguistics (EACL), 2022.
- Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023.
- Chef: A comprehensive evaluation framework for standardized assessment of multimodal large language models. arXiv preprint arXiv:2311.02692, 2023.
- What makes imagenet look unlike laion. arXiv preprint arXiv:2306.15769, 2023.
- A theory of dynamic benchmarks. arXiv preprint arXiv:2210.03165, 2022.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Cifar-10-warehouse: Broad and more realistic testbeds in model generalization analysis. arXiv preprint arXiv:2310.04414, 2023.
- Measuring robustness to natural distribution shifts in image classification. Conference on Neural Information Processing Systems (NeurIPS), 2020.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
- Learning vision from models rivals learning vision from data. arXiv preprint arXiv:2312.17742, 2023.
- Unbiased look at dataset bias. In Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
- Visual data-type understanding does not emerge from scaling vision-language models. arXiv preprint arXiv:2310.08577, 2023.
- Wim J Van der Linden and Cees AW Glas. Computerized adaptive testing: Theory and practice. Springer, 2000.
- Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy. 2023.
- Anchor points: Benchmarking models with much fewer examples. arXiv preprint arXiv:2309.08638, 2023.
- Analyzing dynamic adversarial training data in the limit. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 202–217, 2022.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Conference on Neural Information Processing Systems (NeurIPS), 2019a.
- Learning robust global representations by penalizing local predictive power. Conference on Neural Information Processing Systems (NeurIPS), 2019b.
- Gmocat: A graph-enhanced multi-objective method for computerized adaptive testing. In Conference on Knowledge Discovery and Data Mining (KDD), 2023.
- Prioritizing test inputs for deep neural networks via mutation analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 397–409. IEEE, 2021.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Discovering bugs in vision models using off-the-shelf image generation and captioning. arXiv preprint arXiv:2208.08831, 2022.
- Sacat: Student-adaptive computerized adaptive testing. In The Fifth International Conference on Distributed Artificial Intelligence, 2023.
- Binary optimization via mathematical programming with equilibrium constraints. arXiv preprint arXiv:1608.04425, 2016.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
- When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2022.
- The visual task adaptation benchmark. 2019.
- Model spider: Learning to rank pre-trained models efficiently. arXiv preprint arXiv:2306.03900, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Vlue: A multi-task multi-dimension benchmark for evaluating vision-language pre-training. In International Conference on Machine Learning (ICML), 2022.
- Fully adaptive framework: Neural computerized adaptive testing for online education. In Conference on Artificial Intelligence (AAAI), 2022.
- Lovm: Language-only vision model selection. arXiv preprint arXiv:2306.08893, 2023.