Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AI Competitions and Benchmarks: Dataset Development (2404.09703v1)

Published 15 Apr 2024 in cs.LG and stat.ML

Abstract: Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even in today's digital era, where substantial data is generated daily, it is uncommon for it to be readily usable; most often, it necessitates meticulous manual data preparation. The haste in developing new models can frequently result in various shortcomings, potentially posing risks when deployed in real-world scenarios (eg social discrimination, critical failures), leading to the failure or substantial escalation of costs in AI-based projects. This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience, in the development of datasets for machine learning. Initially, we develop the tasks involved in dataset development and offer insights into their effective management (including requirements, design, implementation, evaluation, distribution, and maintenance). Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation. Finally, we address practical considerations regarding dataset distribution and maintenance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (132)
  1. Apparent and real age estimation in still images with deep residual regressors on appa-real database. In IEEE International Conference on Automatic Face and Gesture Recognition (FG), pages 87–94, 2017.
  2. Ashly Ajith and G. Gopakumar. Domain adaptation: A survey. In Computer Vision and Machine Intelligence - Lecture Notes in Networks and Systems, pages 591–602. Springer, Singapore, 2023.
  3. Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347, 2010.
  4. Least-squares fitting of two 3-d point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 9(5):698–700, 1987.
  5. Fairness metrics and bias mitigation strategies for rating predictions. Information Processing and Management, 58(5):102646, 2021.
  6. Big self-supervised models advance medical image classification. In IEEE/CVF International Conference on Computer Vision (CVPR), pages 3478–3488, 2021.
  7. The effects of regularization and data augmentation are class dependent. Advances in Neural Information Processing Systems, 35:37878–37891, 2022.
  8. Fairness and Machine Learning: Limitations and Opportunities. MIT Press, 2023.
  9. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. CoRR, abs/1810.01943, 2018.
  10. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6:587–604, 12 2018.
  11. Generalized denoising auto-encoders as generative models. In Advances in Neural Information Processing Systems, volume 26, 2013.
  12. Towards standardization of data licenses: The montreal data license. ArXiv, abs/1903.12262, 2019.
  13. The problem of fairness in synthetic healthcare data. Entropy, 23(9):1165, 2021.
  14. Downstream fairness caveats with synthetic healthcare data. CoRR, abs/2203.04462, 2022.
  15. Fairness-aware machine learning: Practical challenges and lessons learned. In International Conference on Web Search and Data Mining, pages 834–835, 2019.
  16. OpenML benchmarking suites. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
  17. Data fusion. ACM computing surveys (CSUR), 41(1):1–41, 2009.
  18. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
  19. Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics, 59(4):291–294, 1988.
  20. Measuring annotator agreement generally across complex structured, multi-object, and free-text annotation tasks. In Proceedings of the ACM Web Conference, page 1720–1730, 2022.
  21. Signature verification using a “siamese” time delay neural network. Advances in Neural Information Processing Systems, 6:737–744, 1993.
  22. Tom Brown et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 1877–1901, 2020.
  23. Active semi-supervised learning for biological data classification. PLOS ONE, 15(8):1–20, 2020.
  24. Towards the analysis of how anonymization affects usefulness of health data in the context of machine learning. In International Symposium on Computer-Based Medical Systems, pages 604–608, 2019.
  25. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, 2017.
  26. Comparing random-based and k-anonymity-based algorithms for graph anonymization. In Modeling Decisions for Artificial Intelligence, volume 7647, pages 197–209. Springer, 2012.
  27. Let’s agree to disagree: Fixing agreement measures for crowdsourcing. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 5(1):11–20, 2017.
  28. A simple framework for contrastive learning of visual representations. In International conference on machine learning, Proceedings of Machine Learning Research (PMLR), pages 1597–1607, 2020.
  29. Exploring simple siamese representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15745–15753, 2021.
  30. Gary Chin. Agile Project Management: How to Succeed in the Face of Changing Project Requirements. AMACOM, 2004. ISBN 9780814427361.
  31. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
  32. Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1548–1568, 2016.
  33. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  34. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017.
  35. Cynthia Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, pages 1–19. Springer Berlin Heidelberg, 2008.
  36. Maurizio Di Paolo Emilio. Data Acquisition Systems: From Fundamentals to Applied Design. Springer, 2013. ISBN 1461442133.
  37. A survey on semi-supervised learning. Machine Learning, 109:373–440, 02 2020. doi: 10.1007/s10994-019-05855-6.
  38. Design of an explainable machine learning challenge for video interviews. In International Joint Conference on Neural Networks (IJCNN), pages 3688–3695, 2017.
  39. Modeling, recognizing, and explaining apparent personality from videos. IEEE Transactions on Affective Computing, 13(2):894–911, 2020.
  40. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems, 25(1):13–21, 2012.
  41. Challenges in data crowdsourcing. IEEE Transactions on Knowledge and Data Engineering, 28(4):901–911, 2016.
  42. Datasheets for datasets. Commun. ACM, 64(12):86–92, nov 2021. ISSN 0001-0782.
  43. Model patching: Closing the subgroup performance gap with data augmentation. arXiv preprint arXiv:2008.06775, 2020.
  44. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  45. Procrustes Problems. Oxford University Press, 2004. ISBN 9780198510581.
  46. Quantization. IEEE Transactions on Information Theory, 44(6):2325–2383, 1998.
  47. Combining active learning and semi-supervised learning using local and global consistency. In Neural Information Processing, pages 215–222. Springer, 2014.
  48. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
  49. Discovering informative patterns and data cleaning, page 181–203. American Association for Artificial Intelligence, USA, 1996.
  50. What size test set gives good error rate estimates? IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 20(01):52–64, 1998.
  51. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81–102, 1978.
  52. John Rowland Higgins. Sampling theory in Fourier and signal analysis: foundations. Oxford Science Publications, 1996. ISBN 0198596995.
  53. A survey of outlier detection methodologies. Artificial intelligence review, 22(2):85–126, 2004.
  54. The dataset nutrition label: A framework to drive higher data quality standards. CoRR, abs/1805.03677, 2018.
  55. Multimodal explanations: Justifying decisions and pointing to the evidence. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8779–8788, 2018.
  56. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In ACM Conference on Fairness, Accountability, and Transparency (FAccT), page 560–575, 2021.
  57. Person perception biases exposed: Revisiting the first impressions dataset. In IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, pages 13–21, 2021.
  58. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  59. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, Proceedings of Machine Learning Research (PMLR), pages 5464–5474, 2021.
  60. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  61. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, page 4743–4751, 2016.
  62. Reduced, reused and recycled: The life of a dataset in machine learning research. arXiv preprint arXiv:2112.01716, 2021.
  63. Klaus Krippendorff. Computing krippendorff’s alpha-reliability. Technical report, Penn collection, Departmental Papers (ASC), 2011.
  64. Towards improving privacy of synthetic datasets. In Nils Gruschka, Luís Filipe Coelho Antunes, Kai Rannenberg, and Prokopios Drogkaris, editors, Privacy Technologies and Policy, pages 106–119, Cham, 2021. Springer International Publishing. ISBN 978-3-030-76663-4.
  65. Counterfactual fairness. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017a.
  66. Grammar variational autoencoder. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research (PMLR), pages 1945–1954, 2017b.
  67. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017.
  68. Pclean: Bayesian data cleaning at scale with domain-specific probabilistic programming. In International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research (PMLR), pages 1927–1935, 2021.
  69. Feature space transfer for data augmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9090–9098, 2018.
  70. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  71. Efficient worker assignment in crowdsourced data labeling using graph signal processing. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2271–2275, 2018.
  72. A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6):1–35, 2021.
  73. A survey on machine learning for data fusion. Information Fusion, 57:115–129, 2020.
  74. Documenting computer vision datasets: An invitation to reflexive data practices. In ACM Conference on Fairness, Accountability, and Transparency (FAccT), page 161–172, 2021.
  75. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  76. Felix Mohr and Jan N. van Rijn. Learning curves for decision making in supervised machine learning - A survey. CoRR, abs/2201.12150, 2022.
  77. Felix Mohr and Jan N. van Rijn. Fast and informative model selection using learning curve cross-validation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(8):9669–9680, 2023.
  78. Graph2vec: Learning distributed representations of graphs. CoRR, abs/1707.05005, 2017.
  79. Sergey Nikolenko. Synthetic Data for Deep Learning. Springer, Cham, 2021. ISBN 978-3-030-75177-7. Part of the Springer Optimization and Its Applications book series (SOIA, volume 174).
  80. How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In Proceedings of the International Conference on Multimedia Information Retrieval, page 557–566, 2010.
  81. Catalogue of bias: attrition bias. BMJ Evidence-Based Medicine, 23(1):21–22, 2018.
  82. Revealing hidden gender biases in competence impressions of faces. Psychological Science, 30(1):65–79, 2019.
  83. Chalearn LAP challenges on self-reported personality recognition and non-verbal behavior forecasting during social dyadic interactions: Dataset, design, and results. In Understanding Social Behavior in Dyadic and Small Group Interactions, volume 173 of Proceedings of Machine Learning Research (PMLR), pages 4–52, 2022.
  84. Context encoders: Feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2536–2544, 2016.
  85. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  86. On the effect of observed subject biases in apparent personality analysis from audio-visual signals. IEEE Transactions on Affective Computing, pages 1–14, 2019.
  87. David Pollard. Quantization and the method of k-means. IEEE Transactions on Information theory, 28(2):199–205, 1982.
  88. ChaLearn LAP 2016: First round challenge on first impressions - dataset and results. In European Conference on Computer Vision Workshop (ECCVW), pages 400–418, 2016.
  89. Discovering fair representations in the data domain. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8219–8228, 2019.
  90. Generating facial ground truth with synthetic faces. In Conference on Graphics, Patterns and Images, pages 25–31, 2010.
  91. The challenge of data annotation in deep learning - a case study on whole plant corn silage. Sensors, 22(4), 2022.
  92. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, 11(3):269–282, 2017.
  93. AI competitions and benchmarks, practical issues: Proposals, grant money, sponsors, prizes, dissemination, publicity. CoRR, abs/2401.04452, 2024.
  94. Fakhitah Ridzuan and Wan Mohd Nazmee Wan Zainon. A review on data cleansing methods for big data. Procedia Computer Science, 161:731–738, 2019.
  95. Face recognition: Too bias, or not too bias? In Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1–10, 2020.
  96. A survey on data collection for machine learning: A big data - ai integration perspective. IEEE Transactions on Knowledge and Data Engineering, 33(4):1328–1347, 2021.
  97. Radioactive data: tracing through training. In International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research (PMLR), pages 8326–8335, 2020.
  98. Claude Sammut and Geoffrey I. Webb, editors. TF–IDF, pages 986–987. Springer, Boston, MA, 2010. ISBN 978-0-387-30164-8.
  99. DC-check: A data-centric ai checklist to guide the development of reliable machine learning systems. CoRR, 2211.05764, 2022.
  100. Analysis of imputation bias for feature selection with missing data. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), pages 655–660, 2018.
  101. Biases in feature selection with missing data. Neurocomputing, 342:97–112, 2019.
  102. Burr Settles. Active learning literature survey. Computer Sciences Technical Report, 1648. University of Wisconsin-Madison, Department of Computer Sciences, 2009.
  103. Unintentional affective priming during labeling may bias labels. In International Conference on Affective Computing and Intelligent Interaction, pages 587–593, 2019.
  104. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019.
  105. Kd-str: A method for spatio-temporal data reduction and modelling. ACM/IMS Transactions on Data Science, 2(3), 2021.
  106. Latanya Sweeney. Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy, 2000.
  107. Blinded by beauty: Attractiveness bias and accurate perceptions of academic performance. PLOS ONE, 11(2):1–18, 2016.
  108. Learning to label with active learning and reinforcement learning. In Christian S. Jensen, Ee-Peng Lim, De-Nian Yang, Wang-Chien Lee, Vincent S. Tseng, Vana Kalogeraki, Jen-Wei Huang, and Chih-Ya Shen, editors, Database Systems for Advanced Applications, pages 549–557. Springer International Publishing, 2021. ISBN 978-3-030-73197-7.
  109. Eliminating background-bias for robust person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5794–5803, 2018.
  110. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 30(11):1958–1970, 2008.
  111. A closer look at spatiotemporal convolutions for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6450–6459, 2018.
  112. Meta-album: Multi-domain meta-dataset for few-shot image classification. In Advances in Neural Information Processing Systems 35, 2022.
  113. Stef Van Buuren. Flexible imputation of missing data. CRC press, 2018.
  114. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.
  115. Paul Voigt and Axel von dem Bussche. The EU General Data Protection Regulation (GDPR): A Practical Guide. Springer, 2017.
  116. In-processing modeling techniques for machine learning fairness: A survey. ACM Transactions on Knowledge Discovery from Data, 2022.
  117. Revise: A tool for measuring and mitigating bias in visual datasets. International Journal of Computer Vision, 130(7):1790–1810, 2022.
  118. Fast-join: An efficient method for fuzzy token matching based string similarity join. In IEEE International Conference on Data Engineering, pages 458–469, 2011.
  119. A survey of transfer learning. Journal of Big data, 3(1):1–40, 2016.
  120. Mark D. Wilkinson et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1):160018, 2016.
  121. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987.
  122. Fake it till you make it: Face analysis in the wild using synthetic data alone. CoRR, abs/2109.15102, 2021.
  123. A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE, 104(9):1681–1710, 2016.
  124. Generation and evaluation of privacy preserving synthetic health data. Neurocomputing, 416:244–255, 2020.
  125. Mitigating biases in multimodal personality assessment. In International Conference on Multimodal Interaction (ICMI), page 361–369, 2020.
  126. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3):5718–5727, 2009.
  127. Sensor and sensor fusion technology in autonomous vehicles: A review. Sensors, 21(6):2140, 2021.
  128. Weakly supervised object localization and detection: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.
  129. Learning from crowdsourced labeled data: A survey. Artificial Intelligence Review, 46(4):543–576, 2016.
  130. Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(11):1330–1334, 2000.
  131. A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and limitations. Expert Systems with Applications, 242:122807, 2024.
  132. Zhi-Hua Zhou. A brief introduction to weakly supervised learning. National science review, 5(1):44–53, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Romain Egele (13 papers)
  2. Julio C. S. Jacques Junior (11 papers)
  3. Jan N. van Rijn (23 papers)
  4. Isabelle Guyon (55 papers)
  5. Albert Clapés (14 papers)
  6. Prasanna Balaprakash (91 papers)
  7. Sergio Escalera (127 papers)
  8. Thomas Moeslund (1 paper)
  9. Jun Wan (79 papers)
  10. Xavier Baró (9 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com