Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques (2203.15753v4)

Published 29 Mar 2022 in cs.LG, cs.HC, and stat.ML

Abstract: Despite the tremendous advances in ML, training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (121)
  1. ModelTracker: Redesigning performance analysis tools for machine learning. In Proc. of 33rd Annual ACM Conference on Human Factors in Computing Systems (2015), ACM, pp. 337–346. doi:10.1145/2702123.2702509.
  2. Visual methods for analyzing probabilistic classification data. IEEE TVCG 20, 12 (2014), 1703–1712. doi:10.1109/TVCG.2014.2346660.
  3. Altman N. S.: An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175–185.
  4. Arakawa R., Yakura H.: REsCUE: A framework for real-time feedback on behavioral cues using multimodal anomaly detection. In Proc. of ACM CHI (2019), ACM, p. 1–13. doi:10.1145/3290605.3300802.
  5. A taxonomy of property measures to unify active learning and human-centered approaches to data labeling. ACM Transactions on Interactive Intelligent Systems 11, 3–4 (sep 2021). doi:10.1145/3439333.
  6. Comparing visual-interactive labeling with active learning: An experimental study. IEEE TVCG 24, 1 (2018), 298–308. doi:10.1109/TVCG.2017.2744818.
  7. LOF: Identifying density-based local outliers. ACM SIGMOD Record 29, 2 (May 2000), 93–104. doi:10.1145/335191.335388.
  8. Classifier-guided visual correction of noisy labels for image classification tasks. Computer Graphics Forum 39, 3 (2020), 195–205. doi:10.1111/cgf.13973.
  9. Bay S. D., Schwabacher M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proc. of ACM SIGKDD (2003), ACM, p. 29–38. doi:10.1145/956750.956758.
  10. Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-sampling TEchnique for handling the class imbalanced problem. In Proc. of Advances in Knowledge Discovery and Data Mining (2009), Springer Berlin Heidelberg, pp. 475–482.
  11. Towards user-centered active learning algorithms. Computer Graphics Forum 37, 3 (2018), 121–132. doi:10.1111/cgf.13406.
  12. SMOTE: Synthetic minority over-sampling technique. JAIR 16, 1 (June 2002), 321–357.
  13. Anomaly detection: A survey. ACM Computing Surveys 41, 3 (July 2009). doi:10.1145/1541880.1541882.
  14. Combating imbalance in network intrusion datasets. In Proc. of IEEE GrC (2006), pp. 732–737. doi:10.1109/GRC.2006.1635905.
  15. Castor de Melo C. E., Prudencio R. B. C.: Cost-sensitive measures of algorithm similarity for meta-learning. In Proc. of BRACIS (2014), pp. 7–12. doi:10.1109/BRACIS.2014.13.
  16. Chen T., Guestrin C.: XGBoost: A scalable tree boosting system. In Proc. of ACM KDD (2016), ACM, pp. 785–794. doi:10.1145/2939672.2939785.
  17. Z-Glyph: Visualizing outliers in multivariate data. Information Visualization 17, 1 (2018), 22–40. doi:10.1177/1473871616686635.
  18. The state of the art in enhancing trust in machine learning models with the use of visualizations. Computer Graphics Forum 39, 3 (June 2020), 713–756. doi:10.1111/cgf.14034.
  19. A survey of surveys on the use of visualization for interpreting machine learning models. Information Visualization 19, 3 (July 2020), 207–233. doi:10.1177/1473871620904671.
  20. t-viSNE: Interactive assessment and interpretation of t-SNE projections. IEEE TVCG 26, 8 (Aug. 2020), 2696–2714. doi:10.1109/TVCG.2020.2986996.
  21. VisRuler: Visual analytics for extracting decision rules from bagged and boosted decision treest. Information Visualization (2022). To appear.
  22. StackGenVis: Alignment of data, algorithms, and models for stacking ensemble learning using performance metrics. IEEE TVCG 27, 2 (Feb. 2021), 1547–1557. doi:10.1109/TVCG.2020.3030352.
  23. VisEvol: Visual analytics to support hyperparameter search through evolutionary optimization. Computer Graphics Forum 40, 3 (June 2021), 201–214. doi:10.1111/cgf.14300.
  24. FeatureEnVi: Visual analytics for feature engineering using stepwise selection and semi-automatic extraction approaches. IEEE TVCG 28, 4 (2022), 1773–1791. doi:10.1109/TVCG.2022.3141040.
  25. TargetVue: Visual analysis of anomalous user behaviors in online communication systems. IEEE TVCG 22, 1 (2016), 280–289. doi:10.1109/TVCG.2015.2467196.
  26. Czarnecki W. M., Tabor J.: Extreme entropy machines: Robust information theoretic classification. Pattern Analysis and Applications 20, 2 (2017), 383–400.
  27. Collaris D., Van Wijk J.: StrategyAtlas: Strategy analysis for machine learning interpretability. IEEE TVCG (2022), 1–1. doi:10.1109/TVCG.2022.3146806.
  28. Weighted data gravitation classification for standard and imbalanced data. IEEE Transactions on Cybernetics 43, 6 (2013), 1672–1687. doi:10.1109/TSMCB.2012.2227470.
  29. D3 — Data-driven documents. https://d3js.org/, 2011. Accessed June 30, 2022.
  30. Deng J., Brown E. T.: RISSAD: Rule-based interactive semi-supervised anomaly detection. In Proc. of EuroVis — Short Papers (2021), The Eurographics Association. doi:10.2312/evs.20211050.
  31. Dua D., Graff C.: UCI machine learning repository. http://archive.ics.uci.edu/ml, 2017. Accessed June 30, 2022.
  32. Dua D., Graff C.: UCI Machine Learning Repository, 2017.
  33. Toward a quantitative survey of dimension reduction techniques. IEEE TVCG 27, 3 (2021), 2153–2173. doi:10.1109/TVCG.2019.2944182.
  34. Fix E., Hodges J. L.: Discriminatory analysis. Nonparametric discrimination: Consistency properties. International Statistical Review/Revue Internationale de Statistique 57, 3 (1989), 238–247.
  35. FISHER R. A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 2 (1936), 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x.
  36. Flask — A micro web framework written in Python, 2010. Accessed June 30, 2022.
  37. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14, 5 (Sept. 1999), 771–780.
  38. Does the layout really matter? A study on visual model accuracy estimation. In Proc. of IEEE VIS (2021), pp. 61–65. doi:10.1109/VIS49827.2021.9623326.
  39. An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In Proc. of CIARP (2007), Springer-Verlag, p. 397–406.
  40. Hodge V., Austin J.: A survey of outlier detection methodologies. Artificial Intelligence Review 22, 2 (2004), 85–126.
  41. Hamid O. H.: From model-centric to data-centric AI: A paradigm shift or rather a complementary approach? In Proc. of the 8th International Conference on Information Technology Trends (ITT) (2022), pp. 196–199. doi:10.1109/ITT56123.2022.9863935.
  42. Hart P.: The condensed nearest neighbor rule (corresp.). IEEE Transactions on Information Theory 14, 3 (1968), 515–516. doi:10.1109/TIT.1968.1054155.
  43. HardVis Code, 2022. Accessed June 30, 2022. URL: https://github.com/angeloschatzimparmpas/HardVis.
  44. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proc. of IEEE IJCNN (2008), pp. 1322–1328. doi:10.1109/IJCNN.2008.4633969.
  45. Rare category exploration. Expert Systems with Applications 41, 9 (2014), 4197–4210. doi:10.1016/j.eswa.2013.12.039.
  46. He H., Garcia E. A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263–1284. doi:10.1109/TKDE.2008.239.
  47. RADAR: Rare category detection via computation of boundary degree. In Proc. of Advances in Knowledge Discovery and Data Mining (2011), Springer Berlin Heidelberg, pp. 258–269.
  48. Outlier detection using replicator neural networks. In Proc. of Data Warehousing and Knowledge Discovery (2002), Springer Berlin Heidelberg, pp. 170–180.
  49. Big data fraud detection using multiple medicare data sources. Journal of Big Data 5, 1 (2018), 1–21.
  50. Graph-based rare category detection. In Proc. of IEEE ICDM (2008), pp. 833–838. doi:10.1109/ICDM.2008.122.
  51. He H., Ma Y.: Imbalanced learning: Foundations, algorithms, and applications. Intelligent Systems & Agents (2013).
  52. Overlap removal of dimensionality reduction scatterplot layouts. ArXiv e-prints 1903.06262 (2019). arXiv:1903.06262.
  53. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proc. of Advances in Intelligent Computing (2005), Springer Berlin Heidelberg, pp. 878–887.
  54. Japkowicz N.: Concept-learning in the presence of between-class and within-class imbalances. In Proc. of Advances in Artificial Intelligence (2001), Springer Berlin Heidelberg, pp. 67–77.
  55. Paired feature multilayer ensemble–concept and evaluation of a classifier. Journal of Intelligent & Fuzzy Systems 32, 2 (2017), 1427–1436.
  56. Machine learning for the detection of oil spills in satellite radar images. Machine learning 30, 2 (1998), 195–215.
  57. Kwak S. K., Kim J. H.: Statistical data preparation: Management of missing values and outliers. Korean Journal of Anesthesiology 70, 4 (2017), 407.
  58. Kubat M., Matwin S.: Addressing the curse of imbalanced training sets: One-sided selection. In Proc. of ICML (1997), Morgan Kaufmann, pp. 179–186.
  59. Krawczyk B.: Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5, 4 (2016), 221–232.
  60. Kruskal J. B.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1 (Mar. 1964), 1–27. doi:10.1007/BF02289565.
  61. Laurikkala J.: Improving identification of difficult small classes by balancing class distribution. In Proc. of AIME (2001), Springer-Verlag, p. 63–66.
  62. Prior-free rare category detection: More effective and efficient solutions. Expert Systems with Applications 41, 17 (Dec. 2014), 7691–7706. doi:10.1016/j.eswa.2014.06.026.
  63. Interactive machine learning by visualization: A small data solution. In Proc. of IEEE BigData (2018), pp. 3513–3521. doi:10.1109/BigData.2018.8621952.
  64. RCLens: Interactive rare category exploration and identification. IEEE TVCG 24, 7 (2018), 2223–2237. doi:10.1109/TVCG.2017.2711030.
  65. Mahoney M., Chan P.: Learning rules for anomaly detection of hostile network traffic. In Proc. of IEEE ICDM (2003), pp. 601–604. doi:10.1109/ICDM.2003.1250987.
  66. Melnik O.: Decision region connectivity analysis: A method for analyzing high-dimensional classifiers. Machine Learning 48, 1–3 (Sep. 2002), 321–351. doi:10.1023/A:1013968124284.
  67. UMAP: Uniform manifold approximation and projection for dimension reduction. ArXiv e-prints 1802.03426 (Feb. 2018). arXiv:1802.03426.
  68. Traffic anomaly detection using k-means clustering. In GI/ITG Workshop MMBnet (2007), vol. 7, p. 9.
  69. Ma Y., Maciejewski R.: Visual analysis of class separations with locally linear segments. IEEE TVCG 27, 1 (2021), 241–253. doi:10.1109/TVCG.2020.3011155.
  70. RuleMatrix: Visualizing and understanding classifiers with rules. IEEE TVCG 25, 1 (2019), 342–352. doi:10.1109/TVCG.2018.2864812.
  71. Martinetz T., Schulten K.: Topology representing networks. Neural Networks 7, 3 (1994), 507–522. doi:10.1016/0893-6080(94)90109-0.
  72. Explaining vulnerabilities to adversarial machine learning through visual analytics. IEEE TVCG 26, 1 (Jan. 2020), 1075–1085. doi:10.1109/TVCG.2019.2934631.
  73. Nogueira F.: Bayesian Optimization. https://git.io/vov5M, 2014. Accessed June 30, 2022.
  74. Napierala K., Stefanowski J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. JIIS 46, 3 (2016), 563–597.
  75. Learning from imbalanced data in presence of noisy and borderline examples. In Proc. of RSCTC (2010), Springer Berlin Heidelberg, pp. 158–167.
  76. Class imbalances versus class overlapping: An analysis of a learning system behavior. In Proc. of MICAI (2004), Springer Berlin Heidelberg, pp. 312–321.
  77. Analysis of instance hardness in machine learning using item response theory. In Proc. of LMCE (2015).
  78. Plotly — JavaScript open source graphing library. https://plot.ly, 2010. Accessed June 30, 2022.
  79. Concentric RadViz: Visual exploration of multi-task classification. In Proc. of 28th SIBGRAPI Conference on Graphics, Patterns and Images (2015), pp. 165–172. doi:10.1109/SIBGRAPI.2015.38.
  80. An approach to supporting incremental visual data classification. IEEE TVCG 21, 1 (2015), 4–17. doi:10.1109/TVCG.2014.2331979.
  81. Scikit-Learn: Machine learning in Python. JMLR 12 (Nov. 2011), 2825–2830. doi:10.5555/1953048.2078195.
  82. Squares: Supporting interactive performance analysis for multiclass classifiers. IEEE TVCG 23, 1 (Jan. 2017), 61–70. doi:10.1109/TVCG.2016.2598828.
  83. VERONICA: Visual analytics for identifying feature groups in disease classification. Information 12, 9 (2021). doi:10.3390/info12090344.
  84. Ravindran S.: Learning with imprecise classes, rare instances, and complex relationships. In Proc. of AAAI/SIGART Doctoral Consortium (2011).
  85. Data mining for improved cardiac care. ACM SIGKDD Explorations Newsletter 8, 1 (June 2006), 3–10. doi:10.1145/1147234.1147236.
  86. Topological data analysis of decision boundaries with application to model selection. In Proc. of ICML (09–15 June 2019), vol. 97, PMLR, pp. 5351–5360.
  87. IFROWANN: Imbalanced fuzzy-rough ordered weighted average nearest neighbor classification. IEEE Transactions on Fuzzy Systems 23, 5 (2015), 1622–1637. doi:10.1109/TFUZZ.2014.2371472.
  88. Noise versus outliers. Springer International Publishing, 2016, pp. 163–183. doi:10.1007/978-3-319-43742-2_14.
  89. Settles B.: Active learning. Synthesis lectures on artificial intelligence and machine learning 6, 1 (2012), 1–114.
  90. Visual parameter space analysis: A conceptual framework. IEEE TVCG 20, 12 (2014), 2161–2170. doi:10.1109/TVCG.2014.2346321.
  91. Siebert J. P.: Vehicle Recognition Using Rule Based Methods. Research Memorandum TIRM-87-018, Turing Institute, Mar. 1987.
  92. Skryjomski P., Krawczyk B.: Influence of minority class instance types on smote imbalanced data oversampling. In Proc. of LIDTA (Sep. 2017), vol. 74, PMLR, pp. 7–21.
  93. SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences 291 (2015), 184–203. doi:10.1016/j.ins.2014.08.051.
  94. An instance level analysis of data complexity. Machine learning 95, 2 (2014), 225–256.
  95. Unsupervised clustering approach for network anomaly detection. In Proc. of Networked Digital Technologies (2012), Springer Berlin Heidelberg, pp. 135–145.
  96. Stefanowski J.: Dealing with data difficulty factors while learning from imbalanced data. In Challenges in Computational Statistics and Data Mining. Springer, 2016, pp. 333–363.
  97. Shwartz-Ziv R., Armon A.: Tabular data: Deep learning is not all you need. Inf. Fus. 81 (2022), 84–90. doi:10.1016/j.inffus.2021.11.011.
  98. Reordering dimensions for Radial Visualization of multidimensional data — A Genetic Algorithms approach. In Proc. of IEEE CEC (2014), pp. 951–958. doi:10.1109/CEC.2014.6900619.
  99. Tomek I.: An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics SMC-6, 6 (1976), 448–452. doi:10.1109/TSMC.1976.4309523.
  100. Vanerio J., Casas P.: Ensemble-learning approaches for network security and anomaly detection. In Proc. of ACM Big-DAMA (2017), ACM, p. 1–6. doi:10.1145/3098593.3098594.
  101. Van Hulse J., Khoshgoftaar T.: Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68, 12 (2009), 1513–1542. doi:10.1016/j.datak.2009.08.005.
  102. Vue.js — The progressive JavaScript framework. https://vuejs.org/, 2014. Accessed June 30, 2022.
  103. Vatturi P., Wong W.-K.: Category detection using hierarchical mean shift. In Proc. of ACM SIGKDD (2009), ACM, p. 847–856. doi:10.1145/1557019.1557112.
  104. In defence of visual analytics systems: Replies to critics. IEEE TVCG (2022), 1–11. doi:10.1109/TVCG.2022.3209360.
  105. A survey of multiple classifier systems as hybrid systems. Inf. Fus. 16 (2014), 3–17. doi:10.1016/j.inffus.2013.04.006.
  106. Weiss G. M., Hirsh H.: A quantitative study of small disjuncts. AAAI/IAAI 2000, 665-670 (2000), 15.
  107. Wilson D. L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics SMC-2, 3 (1972), 408–421. doi:10.1109/TSMC.1972.4309137.
  108. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449–475.
  109. Bayesian network anomaly pattern detection for disease outbreaks. In Proc. of ICML (2003), pp. 808–815.
  110. Survey on the analysis of user interactions and visualization provenance. Computer Graphics Forum 39, 3 (2020), 757–783. doi:10.1111/cgf.14035.
  111. EnsembleLens: Ensemble-based visual exploration of anomaly detection algorithms with multidimensional data. IEEE TVCG 25, 1 (Jan. 2019), 109–119. doi:10.1109/TVCG.2018.2864825.
  112. Interactive correction of mislabeled training data. In Proc. of 2019 IEEE Conference on Visual Analytics Science and Technology (VAST) (2019), pp. 57–68. doi:10.1109/VAST47406.2019.8986943.
  113. A survey of visual analytics techniques for machine learning. Computational Visual Media 7, 1 (2021), 3–36. doi:10.1007/s41095-020-0191-7.
  114. An instance-oriented performance measure for classification. Information Sciences 580 (2021), 598–619. doi:10.1016/j.ins.2021.08.094.
  115. BIDI: A classification algorithm with instance difficulty invariance. Expert Systems with Applications 165 (2021), 113920. doi:10.1016/j.eswa.2020.113920.
  116. On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery 8, 3 (2004), 275–300.
  117. Evaluation of sampling methods for scatterplots. IEEE TVCG 27, 2 (2021), 1720–1730. doi:10.1109/TVCG.2020.3030432.
  118. #FluxFlow: Visual analysis of anomalous information spreading on social media. IEEE TVCG 20, 12 (2014), 1773–1782. doi:10.1109/TVCG.2014.2346922.
  119. Oui! Outlier interpretation on multi-dimensional data via visual analytics. Computer Graphics Forum 38, 3 (June 2019), 213–224. doi:10.1111/cgf.13683.
  120. LSHiForest: A generic framework for fast tree isolation based ensemble anomaly analysis. In Proc. of IEEE ICDE (2017), pp. 983–994. doi:10.1109/ICDE.2017.145.
  121. SliceTeller : A data slice-driven approach for machine learning model validation. IEEE TVCG (2022), 1–11. doi:10.1109/TVCG.2022.3209465.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com