Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Readiness for AI: A 360-Degree Survey (2404.05779v2)

Published 8 Apr 2024 in cs.LG and cs.AI

Abstract: AI applications critically depend on data. Poor quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Evaluation of data readiness is a crucial step in improving the quality and appropriateness of data usage for AI. R&D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are still evolving. In this study, we perform a comprehensive survey of metrics used to verify data readiness for AI training. This survey examines more than 140 papers published by ACM Digital Library, IEEE Xplore, journals such as Nature, Springer, and Science Direct, and online articles published by prominent AI experts. This survey aims to propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets. We anticipate that this taxonomy will lead to new standards for DRAI metrics that will be used for enhancing the quality, accuracy, and fairness of AI training and inference.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (132)
  1. 2008. PEVQ – the Standard for Perceptual Evaluation of Video Quality. http://www.pevq.com/pevq.html Accessed 22 July 2023.
  2. n.d.. BS.1387 : Method for Objective Measurements of Perceived Audio Quality. https://www.itu.int/rec/R-REC-BS.1387/en. Accessed 17 July 2023.
  3. n.d. The Gunning Fog Index. https://readable.com/readability/gunning-fog-index/. Accessed 12 July 2023.
  4. n.d.. JNDmetrix Technology. http://www.sarnoff.com/products_services/video_vision/jndmetrix/. Accessed 12 July 2023.
  5. n.d.. Kaggle. https://www.kaggle.com/ Accessed: Sept 2023.
  6. n.d.. PSNR. https://www.mathworks.com/help/vision/ref/psnr.html Data Accessed 7/26/2023.
  7. Georgios Afendras and Marianthi Markatou. 2019. Optimality of training/test size and resampling effectiveness in cross-validation. Journal of Statistical Planning and Inference 199 (2019), 286–301. https://doi.org/10.1016/j.jspi.2018.07.005
  8. Data Readiness Report. arXiv:2010.07213 [cs.DB]
  9. Learning from Imbalanced Data Sets. Springer.
  10. Robbie Allen. 2019. Assessing Your Data Readiness for Machine Learning. Medium. https://medium.com/machine-learning-in-practice/assessing-your-data-readiness-for-machine-learning-ab97e0e81166 Accessed on June 27, 2023.
  11. Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. Journal of Choice Modelling 28 (2018), 167–182. https://doi.org/10.1016/j.jocm.2018.07.002
  12. Sevgi Arca and Rattikorn Hewett. 2020. Is Entropy enough for measuring Privacy?. In 2020 International Conference on Computational Science and Computational Intelligence (CSCI). 1335–1340. https://doi.org/10.1109/CSCI51800.2020.00249
  13. E-FAIR-DB: Functional Dependencies to Discover Data Bias and Enhance Data Equity. J. Data and Information Quality 14, 4, Article 29 (nov 2022), 26 pages. https://doi.org/10.1145/3552433
  14. J. Beerends and J. Stemerdink. 1994. A Perceptual Speech Quality Measure Based on a Psychoacoustic Sound Representation. Journal of Audio Eng. Soc. 42 (December 1994), 115–123.
  15. Michele Bezzi. 2007. An entropy based method for measuring anonymity. In 2007 Third International Conference on Security and Privacy in Communications Networks and the Workshops - SecureComm 2007. 28–32. https://doi.org/10.1109/SECCOM.2007.4550303
  16. Edd Biddle and Paul Christensen. n.d. Prepare Your Data for AI and Data Science. https://www.ibm.com/garage/method/practices/code/data-preparation-ai-data-science/. Accessed 26 June 2023.
  17. B. Blaiszik et al. 2016. The Materials Data Facility: Data Services to Advance Materials Science Research. JOM 68 (2016). https://doi.org/10.1007/s11837-016-2001-3
  18. Roger Blake and Paul Mangiameli. 2011. The Effects and Interactions of Data Quality and Problem Complexity on Classification. J. Data and Information Quality 2, 2, Article 8 (feb 2011), 28 pages. https://doi.org/10.1145/1891879.1891881
  19. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (January 2003), 993–1022. Submitted 2/02; Published 1/03.
  20. Netflix Technology Blog. 2017. Toward a practical perceptual video quality metric. https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652
  21. Visual Interactive Creation, Customization, and Analysis of Data Quality Metrics. J. Data and Information Quality 10, 1, Article 3 (may 2018), 26 pages. https://doi.org/10.1145/3190578
  22. Dealing with overlap and imbalance: a new metric and approach. Pattern Anal Applic 21, 2 (2018), 381–395. https://doi.org/10.1007/s10044-016-0583-6
  23. LOF: Identifying density-based local outliers. In Proc. ACM SIGMOD Int. Conf. Manage. Data. 93–104.
  24. The Privacy Onion Effect: Memorization is Relative. arXiv:2206.10469 [cs.LG]
  25. Data preprocessing to mitigate bias: A maximum entropy based approach. arXiv:1906.02164 [cs.LG]
  26. Damon M. Chandler and Sheila S. Hemami. 2007. VSNR: A Wavelet-Based Visual Signal-to-Noise Ratio for Natural Images. IEEE Transactions on Image Processing 16, 9 (2007), 2284–2298. https://doi.org/10.1109/TIP.2007.901820
  27. FAIRshake: toolkit to evaluate the FAIRness of research digital resources. Cell systems 9, 5 (2019), 417–421.
  28. J. Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20, 1 (1960), 37–46. https://doi.org/10.1177/001316446002000104
  29. Meri Coleman and Ta Lin Liau. 1975. A computer readability formula designed for machine scoring. J. of Applied Psychology 60 (1975), 283–284.
  30. Rapid Identification of Column Heterogeneity. In Sixth International Conference on Data Mining (ICDM’06). 159–170. https://doi.org/10.1109/ICDM.2006.132
  31. John C. Davis and Robert J. Sampson. 1986. Statistics and Data Analysis in Geology. Vol. 646. Wiley, New York.
  32. Ali Degirmenci and Omer Karal. 2021. Robust Incremental Outlier Detection Approach Based on a New Metric in Data Streams. IEEE Access 9 (2021), 160347–160360. https://doi.org/10.1109/ACCESS.2021.3131402
  33. IBM Developer. 2021. IBM Data Quality AI Toolkit. https://developer.ibm.com/learningpaths/data-quality-ai-toolkit/overview/ Date accessed: June 12, 2023.
  34. Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.
  35. Pattern Classification. John Wiley & Sons.
  36. SHAPr: An Efficient and Versatile Membership Privacy Risk Metric for Machine Learning. arXiv:2112.02230 [cs.CR]
  37. Editor. 2020. Preparing your dataset for Machine Learning: 10 basic techniques that make your data better. https://www.altexsoft.com/blog/datascience/preparing-your-dataset-for-machine-learning-8-basic-techniques-that-make-your-data-better/
  38. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19, 1 (2007), 1–16. https://doi.org/10.1109/TKDE.2007.250581
  39. Certifying and Removing Disparate Impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD ’15). Association for Computing Machinery, New York, NY, USA, 259–268. https://doi.org/10.1145/2783258.2783311
  40. Rudolf Flesch. 1986. The Art of Readable Writing (19th print.-collier books ed ed.). MacMillan.
  41. George Forman. 2003. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. J. Mach. Learn. Res. 3 (mar 2003), 1289–1305.
  42. Datasheets for Datasets. arXiv preprint arXiv:1803.09010 (2018).
  43. Amirata Ghorbani and James Zou. 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. arXiv:1904.02868 [stat.ML]
  44. C. Gini. 1912. Variability and Mutability: Contribution to the Study of Statistical Distribution and Relations. Studi Economico-Giuridici della R (1912).
  45. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv:2108.05935 [cs.LG]
  46. Mark A. Hall and Lloyd A. Smith. 1999. Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper. In FLAIRS. 235–239.
  47. Hugh Harvey and Ben Glocker. 2019. A Standardised Approach for Preparing Imaging Data for Machine Learning Tasks in Radiology. Springer International Publishing, Cham, 61–72. https://doi.org/10.1007/978-3-319-94878-2_6
  48. Simon S. Haykin. 2009. Neural networks and learning machines (third ed.). Pearson Education, Upper Saddle River, NJ.
  49. Laplacian Score for Feature Selection. In NIPS. 507–514.
  50. Bernd Heinrich and Mathias Klier. 2015. Metric-based data quality assessment — Developing and evaluating a probability-based currency metric. Decision Support Systems 72 (2015), 82–96. https://doi.org/10.1016/j.dss.2015.02.009
  51. Q. Huynh-Thu and M. Ghanbari. 2008. Scope of validity of PSNR in image/video quality assessment. Electronics Letters 44, 13 (Jun 19 2008), 1–2. http://proxy.lib.ohio-state.edu/login?url=https://www.proquest.com/scholarly-journals/scope-validity-psnr-image-video-quality/docview/1625957339/se-2 Copyright - Copyright The Institution of Engineering & Technology Jun 19, 2008; Document feature - Graphs; Tables; ; Last updated - 2015-03-27; CODEN - ELLEAK.
  52. Helen Hwang. 2022. New AI readiness report reveals insights into ML lifecycle. https://www.datacenterknowledge.com/machine-learning/new-ai-readiness-report-reveals-insights-ml-lifecycle. Accessed on May 15, 2023.
  53. International Telecommunication Union. 2018. ITU-T Recommendation P.808: Subjective Evaluation of Speech Quality with a Crowdsourcing Approach. Technical Report. International Telecommunication Union, Geneva.
  54. F. Itakura and S. Saito. 1968. Analysis Synthesis Telephony Based on the Maximum Likelihood Method. In Proc. 6th Int. Congr. Acoust. Tokyo, Japan, C–17–C–20.
  55. M.A. Jaro. 1976. Unimatch: A Record Linkage System: User’s Manual. Technical Report. US Bureau of the Census, Washington, D.C.
  56. N.C. Jayant and P. Noll. 1984. Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice Hall, NJ, USA.
  57. OpenDataVal: a Unified Benchmark for Data Valuation. arXiv:2306.10577 [cs.LG]
  58. Matthew B Jones and Peter Slaughter. 2019. https://www.dataone.org/uploads/dataonewebinar_jonesslaughter_fairmetadata_190514.pdf
  59. V. Roshan Joseph. 2022. Optimal Ratio for Data Splitting. Statistical Analysis and Data Mining: The ASA Data Science Journal 15, 4 (August 2022), 531–538. https://doi.org/10.1002/sam.11583
  60. A Benchmark for Data Imputation Methods. Frontiers in Big Data 4 (2021), 693674. https://doi.org/10.3389/fdata.2021.693674
  61. [PDF] how to measure data quality? - A metric-based approach: Semantic scholar. https://www.semanticscholar.org/paper/How-to-Measure-Data-Quality-A-Metric-Based-Approach-Kaiser-Klier/afcdf53c5a88f3320c861ad3f09f28237b6744cb
  62. Martin Kemka. 2019. Learning Amazon Sagemaker. https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-data-bias-metric-cddl.html
  63. Paras Lakhani. 2020. The Importance of Image Resolution in Building Deep Learning Models for Medical Imaging. Radiology: Artificial Intelligence 2, 1 (2020), e190177. https://doi.org/10.1148/ryai.2019190177
  64. Annotation quality framework-accuracy, credibility, and consistency. In NEURIPS 2021 Workshop for Data Centric AI.
  65. Neil D. Lawrence. 2017. Data Readiness Levels. arXiv:1705.02245 [cs.DB]
  66. George Lawton. 2022. Data Preparation in Machine Learning: 6 key steps. https://www.techtarget.com/searchbusinessanalytics/feature/Data-preparation-in-machine-learning-6-key-steps
  67. V.I. Levenshtein. 1965. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Doklady Akademii Nauk SSSR 163, 4 (1965), 845–848. Original in Russian—translation in Soviet Physics Doklady, vol. 10, no. 8, pp. 707–710, 1966.
  68. David D. Lewis. 1992. Feature Selection and Feature Extraction for Text Categorization. In Workshop on Speech and Natural Language. 212–217.
  69. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Social Psychol. 49, 4 (2013), 764–766.
  70. Feature Selection: A Data Perspective. ACM Comput. Surv. 50, 6, Article 94 (dec 2017), 45 pages. https://doi.org/10.1145/3136625
  71. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 13–24. https://doi.org/10.1109/ICDE51399.2021.00009
  72. Visual distortion gauge based on discrimination of noticeable contrast changes. IEEE transactions on circuits and systems for video technology 15, 7 (2005), 900–909.
  73. Weisi Lin and C.-C. Jay Kuo. 2011. Perceptual visual quality metrics: A survey. Journal of Visual Communication and Image Representation 22, 4 (2011), 297–312. https://doi.org/10.1016/j.jvcir.2011.01.005
  74. Huan Liu and Rudy Setiono. 1995. Chi2: Feature Selection and Discretization of Numeric Attributes. In ICTAI. 388–391.
  75. An ADMM-based Framework for AutoML Pipeline Configuration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4892–4899.
  76. Entropy as a Measure of Average Loss of Privacy. Thai Journal of Mathematics (2017), 7–15. https://api.semanticscholar.org/CorpusID:6672504
  77. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Dataset for Classification Problem. arXiv:1901.10173 [cs.LG]
  78. H. P. Luhn. 1957. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development 1, 4 (1957), 309–317. https://doi.org/10.1147/rd.14.0309
  79. A no-reference perceptual blur metric. In Proceedings. International conference on image processing, Vol. 3. IEEE, III–III.
  80. Philip M McCarthy. 2005. An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD). Ph. D. Dissertation. The University of Memphis.
  81. Peter M. McCarthy and Scott Jarvis. 2010. MTLD, VOCD-D, and HD-D: A Validation Study of Sophisticated Approaches to Lexical Diversity Assessment. Behavior Research Methods 42, 2 (2010), 381–392. https://doi.org/10.3758/BRM.42.2.381
  82. Optimizing Semantic Coherence in Topic Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (Edinburgh, United Kingdom) (EMNLP ’11). Association for Computational Linguistics, USA, 262–272.
  83. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 220–229.
  84. A.E. Monge and C.P. Elkan. 1996. The Field Matching Problem: Algorithms and Applications. In Proc. Second Int’l Conf. Knowledge Discovery and Data Mining (KDD ’96). 267–270.
  85. Evaluating Topic Models for Digital Libraries. In Proceedings of the 10th Annual Joint Conference on Digital Libraries (Gold Coast, Queensland, Australia) (JCDL ’10). Association for Computing Machinery, New York, NY, USA, 215–224. https://doi.org/10.1145/1816123.1816156
  86. Trace Ratio Criterion for Feature Selection. In AAAI. 671–676.
  87. Bias in data-driven artificial intelligence systems—An introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10, 3 (2020), e1356. https://doi.org/10.1002/widm.1356
  88. Sejong Oh. 2011. A new dataset evaluation method based on category overlap. Computers in Biology and Medicine 41, 2 (2011), 115–122. https://doi.org/10.1016/j.compbiomed.2010.12.006
  89. Measuring the class-imbalance extent of multi-class problems. Pattern Recognition Letters 98 (2017), 32–38. https://doi.org/10.1016/j.patrec.2017.08.002
  90. Bias in Word Embeddings. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 446–457. https://doi.org/10.1145/3351095.3372843
  91. Automatic Assessment of Quality of Your Data for AI. In 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD) (Bangalore, India) (CODS-COMAD 2022). Association for Computing Machinery, New York, NY, USA, 354–357. https://doi.org/10.1145/3493700.3493774
  92. A Data Centric AI Framework for Automating Exploratory Data Analysis and Data Quality Tasks. J. Data and Information Quality (jun 2023). https://doi.org/10.1145/3603709
  93. Data Quality Assessment. Commun. ACM 45, 4 (apr 2002), 211–218. https://doi.org/10.1145/505248.506010
  94. Incremental local outlier detection for data streams. In Proc. IEEE Symp. Comput. Intell. Data Mining. 504–515.
  95. A Survey of Data Quality Requirements That Matter in ML Development Pipelines. J. Data and Information Quality (apr 2023). https://doi.org/10.1145/3592616 Just Accepted.
  96. Shahzad Qaiser and Ramsha Ali. 2018. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. International Journal of Computer Applications 181 (07 2018). https://doi.org/10.5120/ijca2018917395
  97. Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Citeseer, 29–48.
  98. FAIR principles for AI models with a practical application for accelerated high energy diffraction microscopy. Scientific Data 9, 1 (nov 2022). https://doi.org/10.1038/s41597-022-01712-9
  99. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Vol. 2. 749–752 vol.2. https://doi.org/10.1109/ICASSP.2001.941023
  100. Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning 53, 1-2 (2003), 23–69.
  101. Exploring the Space of Topic Coherence Measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (Shanghai, China) (WSDM ’15). Association for Computing Machinery, New York, NY, USA, 399–408. https://doi.org/10.1145/2684822.2685324
  102. Bernard Rosner. 1983. Percentage points for a generalized ESD many-outlier procedure. Technometrics 25, 2 (1983), 165–172.
  103. Peter J. Rousseeuw and Mia Hubert. 2018. Anomaly detection by robust statistics. WIREs Data Mining Knowl. Discovery 8, 2 (Mar. 2018), e1236.
  104. R.C. Russell. 1922. Index. http://patft.uspto.gov/netahtml/srchnum.htm
  105. Carl F. Sabottke and Bradley M. Spieler. 2020. The Effect of Image Resolution on Deep Learning in Radiography. Radiology: Artificial Intelligence 2, 1 (2020), e190015. https://doi.org/10.1148/ryai.2019190015
  106. How distance metrics influence missing data imputation with k-nearest neighbours. Pattern Recognition Letters 136 (2020), 111–119. https://doi.org/10.1016/j.patrec.2020.05.032
  107. Ron Schmelzer. 2019. The Achilles’ Heel of AI. https://www.forbes.com/sites/cognitiveworld/2019/03/07/the-achilles-heel-of-ai/?sh=20e53e4d7be7
  108. Representation Bias in Data: A Survey on Identification and Resolution Techniques. ACM Comput. Surv. (mar 2023). https://doi.org/10.1145/3588433 Just Accepted.
  109. H.R. Sheikh and A.C. Bovik. 2006. Image information and visual quality. IEEE Transactions on Image Processing 15, 2 (2006), 430–444. https://doi.org/10.1109/TIP.2005.859378
  110. Data quality: A survey of data quality dimensions. In 2012 International Conference on Information Retrieval & Knowledge Management. 300–304. https://doi.org/10.1109/InfRKM.2012.6204995
  111. Simha. 2021. Understanding TF-IDF for Machine Learning. https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/
  112. Metrics for Identifying Bias in Datasets. SYSTEM (2021).
  113. E. Simpson. 1949. Measurement of Diversity. Nature 163, 688 (1949), 688. https://doi.org/10.1038/163688a0
  114. Liwei Song and Prateek Mittal. 2021. Systematic Evaluation of Privacy Risks of Machine Learning Models. In 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2615–2632. https://www.usenix.org/conference/usenixsecurity21/presentation/song
  115. Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. J. of Documentation 28, 1 (1972), 11–21.
  116. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE Int. Conf. on Acoustics, Speech and Signal Processing. 4214–4217. https://doi.org/10.1109/ICASSP.2010.5495701
  117. Data Preparation for Machine Learning: 5 critical steps to ensure AI success. https://www.informatica.com/blogs/data-preparation-for-machine-learning-5-critical-steps-to-ensure-ai-success.html Accessed 26 June 2023.
  118. Maxine Templin. 1957. Certain Language Skills in Children. University of Minnesota Press, Minneapolis.
  119. Kim-Han Thung and Paramesran Raveendran. 2009. A survey of image quality measures. In 2009 International Conference for Technical Postgraduates (TECHPOS). 1–4. https://doi.org/10.1109/TECHPOS.2009.5412098
  120. Privacy risk quantification in education data using Markov model. British Journal of Educational Technology 53, 4 (2022), 804–821. https://doi.org/10.1111/bjet.13223 arXiv:https://bera-journals.onlinelibrary.wiley.com/doi/pdf/10.1111/bjet.13223
  121. Isabel Wagner and David Eckhoff. 2018. Technical Privacy Metrics: A Systematic Survey. ACM Comput. Surv. 51, 3, Article 57 (jun 2018), 38 pages. https://doi.org/10.1145/3168389
  122. Jiachen T. Wang and Ruoxi Jia. 2023. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. arXiv:2205.15466 [cs.LG]
  123. Zhou Wang and A.C. Bovik. 2002. A universal image quality index. IEEE Signal Processing Letters 9, 3 (2002), 81–84. https://doi.org/10.1109/97.995823
  124. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612. https://doi.org/10.1109/TIP.2003.819861
  125. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2. 1398–1402 Vol.2. https://doi.org/10.1109/ACSSC.2003.1292216
  126. Some Biological Sequence Metrics. Advances in Math. 20, 4 (1976), 367–387.
  127. Wes McKinney. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Stéfan van der Walt and Jarrod Millman (Eds.). 56 – 61. https://doi.org/10.25080/Majora-92bf1922-00a
  128. A design framework and exemplar metrics for fairness. https://www.nature.com/articles/sdata2018118
  129. Alex Woodie. 2020. Data Prep Still Dominates Data Scientists’ Time, Survey Finds. https://www.datanami.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/. Accessed on May 15, 2023.
  130. Mehdi Yalaoui and Saida Boukhedouma. 2021. A survey on data quality: principles, taxonomies and comparison of approaches. In 2021 International Conference on Information Systems and Advanced Technologies (ICISAT). 1–9. https://doi.org/10.1109/ICISAT54145.2021.9678209
  131. Zheng Zhao and Huan Liu. 2007. Spectral Feature Selection for Supervised and Unsupervised Learning. In ICML. 1151–1157.
  132. LRID: A new metric of multi-class imbalance degree based on likelihood-ratio test. Pattern Recognition Letters 116 (2018), 36–42. https://doi.org/10.1016/j.patrec.2018.09.012
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com