Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Lifelong Model Evaluation in an Era of Rapid Progress (2402.19472v2)

Published 29 Feb 2024 in cs.LG and cs.CV

Abstract: Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. These benchmarks introduce a major challenge: the high cost of evaluating a growing number of models across very large sample sets. To address this challenge, we introduce an efficient framework for model evaluation, Sort & Search (S&S)}, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples. To test our approach at scale, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing 1.69M and 1.98M test samples for classification. Extensive empirical evaluations across over 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (about 1000x reduction) on a single A100 GPU, with low approximation error and memory cost of <100MB. Our work also highlights issues with current accuracy prediction metrics, suggesting a need to move towards sample-level evaluation metrics. We hope to guide future research by showing our method's bottleneck lies primarily in generalizing Sort beyond a single rank order and not in improving Search.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (112)
  1. Estimating example difficulty using variance of gradients. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  2. Miklós Ajtai. The complexity of the pigeonhole principle. Combinatorica, 14:417–433, 1994.
  3. Frank B Baker. The basics of item response theory. ERIC, 2001.
  4. Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. In International Conference on Computer Vision (ICCV), 2023.
  5. Deep learning through the lens of example difficulty. Conference on Neural Information Processing Systems (NeurIPS), 2021.
  6. Leaving reality to imagination: Robust classification via generated datasets. International Conference on Learning Representations Workshop (ICLR-W), 2023.
  7. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Conference on Neural Information Processing Systems (NeurIPS), 2019.
  8. On the dangers of stochastic parrots: Can language models be too big? In Conference on Fairness, Accountability, and Transparency (FAccT), 2021.
  9. Are we done with imagenet? In Conference on Neural Information Processing Systems (NeurIPS), 2021.
  10. Quality meets diversity: A model-agnostic framework for computerized adaptive testing. In International Conference on Data Mining (ICDM), 2020.
  11. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. Conference on Neural Information Processing Systems (NeurIPS), 2023.
  12. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2616–2627, 2023.
  13. The ladder: A reliable leaderboard for machine learning competitions. In International Conference on Machine Learning (ICML), 2015.
  14. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  15. Pug: Photorealistic and semantically controllable synthetic data for representation learning. arXiv preprint arXiv:2308.03977, 2023.
  16. What will it take to fix benchmarking in natural language understanding? In North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
  17. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  18. Hibug: On human-interpretable model debug. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
  19. Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement, 1960.
  20. Computing the testing error without a testing set. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  21. Cinic-10 is not imagenet or cifar-10. arXiv preprint arXiv:1810.03505, 2018.
  22. The efficiency misnomer. arXiv preprint arXiv:2110.12894, 2021.
  23. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  24. The spotlight: A general method for discovering systematic errors in deep learning models. In Conference on Fairness, Accountability, and Transparency (FAccT), 2022.
  25. Nats-bench: Benchmarking nas algorithms for architecture topology and size. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.
  26. Understanding dataset difficulty with v-usable information. In International Conference on Machine Learning (ICML), 2022.
  27. Domino: Discovering systematic errors with cross-modal embeddings. International Conference on Learning Representations (ICLR), 2022.
  28. Does progress on imagenet transfer to real-world datasets? In Conference on Neural Information Processing Systems (NeurIPS), 2023.
  29. Balancing test accuracy and security in computerized adaptive testing. International Conference on Artificial Intelligence in Education (AIED), 2023.
  30. Datacomp: In search of the next generation of multimodal datasets. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
  31. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  32. Adaptive testing of computer vision models. In International Conference on Computer Vision (ICCV), 2023.
  33. Evaluating models’ local decision boundaries via contrast sets. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  34. Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. In International Conference on Machine Learning (ICML), 2023.
  35. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations (ICLR), 2018.
  36. Beyond accuracy: quantifying trial-by-trial behaviour of cnns and humans by measuring error consistency. Conference on Neural Information Processing Systems (NeurIPS), 2020.
  37. Bobcat: Bilevel optimization-based computerized adaptive testing. International Joint Conference on Artificial Intelligence (IJCAI), 2021.
  38. Benchmarking neural network robustness to common corruptions and perturbations. International Conference on Learning Representations (ICLR), 2019.
  39. The many faces of robustness: A critical analysis of out-of-distribution generalization. In International Conference on Computer Vision (ICCV), 2021a.
  40. Measuring massive multitask language understanding. International Conference on Learning Representations (ICLR), 2021b.
  41. Natural adversarial examples. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021c.
  42. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. arXiv preprint arXiv:2306.14610, 2023.
  43. Exploring multi-objective exercise recommendations in online education systems. In International Conference on Information and Knowledge Management (CIKM), 2019.
  44. Evaluation gaps in machine learning practice. In Conference on Fairness, Accountability, and Transparency (FAccT), 2022.
  45. Régis Pierrard Ilyas Moutawwakil. Llm-perf leaderboard. https://huggingface.co/spaces/optimum/llm-perf-leaderboard, 2023.
  46. Bring your own data! self-supervised evaluation for large language models. arXiv preprint arXiv:2306.13651, 2023.
  47. Active bayesian assessment of black-box classifiers. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7935–7944, 2021.
  48. Text encoders are performance bottlenecks in contrastive vision-language models. arXiv preprint arXiv:2305.14897, 2023.
  49. Deconstructing distributions: A pointwise framework of learning. International Conference on Learning Representations (ICLR), 2023.
  50. How do humans teach: On curriculum learning and teaching dimension. Advances in neural information processing systems, 24, 2011.
  51. Dynabench: Rethinking benchmarking in nlp. North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
  52. Active testing: Sample-efficient model evaluation. In International Conference on Machine Learning (ICML), 2021.
  53. Active surrogate estimators: An active learning approach to label-efficient model evaluation. Conference on Neural Information Processing Systems (NeurIPS), 2022.
  54. Learning multiple layers of features from tiny images. 2009.
  55. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision (IJCV), 128(7):1956–1981, 2020.
  56. Holistic evaluation of text-to-image models. Conference on Neural Information Processing Systems (NeurIPS), 2023.
  57. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  58. Are we learning yet? a meta review of evaluation failures across machine learning. In Conference on Neural Information Processing Systems (NeurIPS), 2021.
  59. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
  60. Harder or different? a closer look at distribution shift in dataset reproduction. In International Conference on Machine Learning Workshops (ICML-W), 2020.
  61. Data contamination: From memorization to exploitation. arXiv preprint arXiv:2203.08242, 2022.
  62. Model similarity mitigates test set overuse. Conference on Neural Information Processing Systems (NeurIPS), 32, 2019.
  63. Dataperf: Benchmarks for data-centric ai development. Conference on Neural Information Processing Systems (NeurIPS), 2023.
  64. Multi-objective optimization of item selection in computerized adaptive testing. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1018–1026, 2021.
  65. Adversarial nli: A new benchmark for natural language understanding. Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
  66. Mapping global dynamics of benchmark creation and saturation in artificial intelligence. Nature Communications, 13(1):6793, 2022.
  67. Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. arXiv preprint arXiv:2112.07566, 2021.
  68. Red teaming language models with language models. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
  69. Efficient benchmarking (of language models). arXiv preprint arXiv:2308.11696, 2023.
  70. Automated classification of model errors on imagenet. Conference on Neural Information Processing Systems (NeurIPS), 2023.
  71. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024.
  72. Dynasent: A dynamic benchmark for sentiment analysis. Dynasent: A dynamic benchmark for sentiment analysis, 2021.
  73. Online continual learning without the storage constraint. arXiv preprint arXiv:2305.09253, 2023.
  74. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  75. Ai and the everything in the whole wide world benchmark. Conference on Neural Information Processing Systems (NeurIPS), 2021.
  76. Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018.
  77. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning (ICML), 2019.
  78. Evaluation examples are not equally informative: How should that change nlp leaderboards? In Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
  79. A meta-analysis of overfitting in machine learning. Conference on Neural Information Processing Systems (NeurIPS), 2019.
  80. Vote’n’rank: Revision of benchmarking with social choice theory. Annual Meeting of the Association for Computational Linguistics (EACL), 2022.
  81. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023.
  82. Chef: A comprehensive evaluation framework for standardized assessment of multimodal large language models. arXiv preprint arXiv:2311.02692, 2023.
  83. What makes imagenet look unlike laion. arXiv preprint arXiv:2306.15769, 2023.
  84. A theory of dynamic benchmarks. arXiv preprint arXiv:2210.03165, 2022.
  85. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  86. Cifar-10-warehouse: Broad and more realistic testbeds in model generalization analysis. arXiv preprint arXiv:2310.04414, 2023.
  87. Measuring robustness to natural distribution shifts in image classification. Conference on Neural Information Processing Systems (NeurIPS), 2020.
  88. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
  89. Learning vision from models rivals learning vision from data. arXiv preprint arXiv:2312.17742, 2023.
  90. Unbiased look at dataset bias. In Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
  91. Visual data-type understanding does not emerge from scaling vision-language models. arXiv preprint arXiv:2310.08577, 2023.
  92. Wim J Van der Linden and Cees AW Glas. Computerized adaptive testing: Theory and practice. Springer, 2000.
  93. Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy. 2023.
  94. Anchor points: Benchmarking models with much fewer examples. arXiv preprint arXiv:2309.08638, 2023.
  95. Analyzing dynamic adversarial training data in the limit. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 202–217, 2022.
  96. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  97. Superglue: A stickier benchmark for general-purpose language understanding systems. Conference on Neural Information Processing Systems (NeurIPS), 2019a.
  98. Learning robust global representations by penalizing local predictive power. Conference on Neural Information Processing Systems (NeurIPS), 2019b.
  99. Gmocat: A graph-enhanced multi-objective method for computerized adaptive testing. In Conference on Knowledge Discovery and Data Mining (KDD), 2023.
  100. Prioritizing test inputs for deep neural networks via mutation analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 397–409. IEEE, 2021.
  101. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  102. Discovering bugs in vision models using off-the-shelf image generation and captioning. arXiv preprint arXiv:2208.08831, 2022.
  103. Sacat: Student-adaptive computerized adaptive testing. In The Fifth International Conference on Distributed Artificial Intelligence, 2023.
  104. Binary optimization via mathematical programming with equilibrium constraints. arXiv preprint arXiv:1608.04425, 2016.
  105. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  106. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2022.
  107. The visual task adaptation benchmark. 2019.
  108. Model spider: Learning to rank pre-trained models efficiently. arXiv preprint arXiv:2306.03900, 2023.
  109. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  110. Vlue: A multi-task multi-dimension benchmark for evaluating vision-language pre-training. In International Conference on Machine Learning (ICML), 2022.
  111. Fully adaptive framework: Neural computerized adaptive testing for online education. In Conference on Artificial Intelligence (AAAI), 2022.
  112. Lovm: Language-only vision model selection. arXiv preprint arXiv:2306.08893, 2023.
Citations (5)

Summary

  • The paper presents a dynamic lifelong benchmark framework that combats overfitting by continuously expanding diverse test datasets.
  • It introduces the S{content}S method, which reduces computation from 180 GPU days to 5 GPU hours, achieving remarkable efficiency gains.
  • Empirical validation on ~31,000 models shows high accuracy correlations, underscoring the method’s potential for scalable, reliable evaluation.

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

The paper "Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress" by Prabhu et al. addresses significant challenges in the field of machine learning evaluations, where static benchmarks such as ImageNet and CIFAR-10 have traditionally dominated. As models are repeatedly tested on these static datasets, there exists an inherent risk of overfitting to the peculiarities of the dataset rather than genuinely learning to generalize. This research proposes using "Lifelong Benchmarks," a dynamic alternative designed to persistently expand and mitigate the overfitting issue by compiling ever-growing pools of test samples, effectively setting a new paradigm for model evaluation.

Methodological Contributions

Lifelong Benchmarks Construction: The authors introduce Lifelong-CIFAR10 and Lifelong-ImageNet benchmarks. These benchmarks are not static but designed to evolve by including a diverse array of samples that cumulatively exceed 1.6 million in size, extracted from a range of domains to maintain global distributional characteristics. In Lifelong-CIFAR10, 1.69 million test samples are selected, while for Lifelong-ImageNet, the count goes up to 1.98 million samples. This endeavor ensures that the benchmarks represent a wide-cross-section of visual domains and are thus more resistant to overfitting.

Efficient Model Evaluation Framework: To address the pressing challenge of evaluation cost exacerbated by continually expanding benchmark sizes, the authors propose an efficient evaluation methodology titled "Sort {content} Search" (S{content}S). Drawing inspiration from Computerized Adaptive Testing (CAT), S{content}S optimizes benchmarking evaluations by employing dynamic programming to estimate model performance using significantly fewer samples. This method reduces computational overhead from 180 GPU days to merely 5 GPU hours on a single A100 GPU, indicating a remarkable 1000x efficiency gain.

Experimental Results and Analysis

The proposed benchmarks and framework were empirically validated using \sim31,000 models, delivering promising results in balancing evaluation accuracy and cost. The Sort {content} Search algorithm demonstrated low approximation error, evidencing its utility in highly-efficient approximate accuracy estimation. Error decomposition analysis in the paper confirmed that much of the error was irreducible, tied not to sampling inefficiencies but inherent duality in sample ordering and model suitability. Despite significant reductions in compute cost, the model performances predicted were highly consistent with ground truths, showcased by high correlation coefficients between estimated and actual accuracies.

Implications and Future Directions

The implications of this research are profound, both practically and theoretically. Practically, it provides a scalable solution to the "benchmark exhaustion" problem, enabling continuous evaluation without an overwhelming resource burden. Theoretically, it suggests a shift in benchmark design philosophy, aligning it more with the use case realities and expansive, diverse datasets that better simulate real-world scenarios.

Looking forward, some promising research avenues emerge from this work. (1) Extending the current one-step evaluation process to a multi-step one may capture a wider range of model behaviors. (2) Given that the systematic ranking induced epistemic errors were largely irreducible, forthcoming research could explore non-linear sample ranking structures to better accommodate diverse model characteristics. Finally, (3) automating hard sample discovery and labelling can further enhance the robustness of Lifelong Benchmarks, ensuring they align closely with evolving models and their complex failures.

In conclusion, this paper sets a clear trajectory away from traditional, static benchmarks towards more agile, representative, and efficient systems of evaluation, capturing the rapid evolution of models and the diversity inherent in the data with which these models interact. As AI continues to pervade various domains, such methodological innovations are critical for ensuring integrity, reliability, and generalizability in machine learning systems.

Youtube Logo Streamline Icon: https://streamlinehq.com