Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeforestVis: Behavior Analysis of Machine Learning Models with Surrogate Decision Stumps (2304.00133v5)

Published 31 Mar 2023 in cs.LG and cs.HC

Abstract: As the complexity of ML models increases and their application in different (and critical) domains grows, there is a strong demand for more interpretable and trustworthy ML. A direct, model-agnostic, way to interpret such models is to train surrogate models-such as rule sets and decision trees-that sufficiently approximate the original ones while being simpler and easier-to-explain. Yet, rule sets can become very lengthy, with many if-else statements, and decision tree depth grows rapidly when accurately emulating complex ML models. In such cases, both approaches can fail to meet their core goal-providing users with model interpretability. To tackle this, we propose DeforestVis, a visual analytics tool that offers summarization of the behaviour of complex ML models by providing surrogate decision stumps (one-level decision trees) generated with the Adaptive Boosting (AdaBoost) technique. DeforestVis helps users to explore the complexity versus fidelity trade-off by incrementally generating more stumps, creating attribute-based explanations with weighted stumps to justify decision making, and analysing the impact of rule overriding on training instance allocation between one or more stumps. An independent test set allows users to monitor the effectiveness of manual rule changes and form hypotheses based on case-by-case analyses. We show the applicability and usefulness of DeforestVis with two use cases and expert interviews with data analysts and model developers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (96)
  1. Towards an effective cooperation of the user and the computer for classification. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2000), KDD ’00, ACM, pp. 179–188. doi:10.1145/347090.347124.
  2. Antweiler D., Fuchs G.: Visualizing rule-based classifiers for clinical risk prognosis. In Proceedings of the IEEE Visualization and Visual Analytics (2022), VIS ’22, IEEE, pp. 55–59. doi:10.1109/VIS54862.2022.00020.
  3. Neural Additive Models: Interpretable machine learning with neural nets. In Advances in Neural Information Processing Systems (2021), vol. 34, Curran Associates, Inc., pp. 4699–4711.
  4. Feedback-driven interactive exploration of large multidimensional data supported by visual classifier. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (2014), VAST ’14, pp. 43–52. doi:10.1109/VAST.2014.7042480.
  5. Barlow T., Neville P.: Case study: Visualization for decision tree analysis in data mining. In Proceedings of the IEEE Symposium on Information Visualization, (2001), INFOVIS ’01, IEEE, pp. 149–152. doi:10.1109/INFVIS.2001.963292.
  6. Breiman L.: Random forests. Machine Learning 45, 1 (Oct. 2001), 5–32. doi:10.1023/A:1010933404324.
  7. Interactive visual comparison of multiple trees. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (2011), VAST ’11, IEEE, pp. 31–40. doi:10.1109/VAST.2011.6102439.
  8. Cao F., Brown E. T.: DRIL: Descriptive rules by interactive learning. In Proceedings of the IEEE Visualization Conference (2020), VIS ’20, pp. 256–260. doi:10.1109/VIS47514.2020.00058.
  9. Cavallo M., Demiralp C.: Clustrophile 2: Guided visual clustering analysis. IEEE TVCG 25, 1 (2019), 267–276. doi:10.1109/TVCG.2018.2864477.
  10. Chen T., Guestrin C.: XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), KDD ’16, ACM, pp. 785–794. doi:10.1145/2939672.2939785.
  11. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015), KDD ’15, ACM, p. 1721–1730. doi:10.1145/2783258.2788613.
  12. The state of the art in enhancing trust in machine learning models with the use of visualizations. Computer Graphics Forum 39, 3 (June 2020), 713–756. doi:10.1111/cgf.14034.
  13. A survey of surveys on the use of visualization for interpreting machine learning models. Information Visualization 19, 3 (July 2020), 207–233. doi:10.1177/1473871620904671.
  14. t-viSNE: Interactive assessment and interpretation of t-SNE projections. IEEE TVCG 26, 8 (Aug. 2020), 2696–2714. doi:10.1109/TVCG.2020.2986996.
  15. VisRuler: Visual analytics for extracting decision rules from bagged and boosted decision trees. Information Visualization 22, 2 (2023), 115–139. doi:10.1177/14738716221142005.
  16. StackGenVis: Alignment of data, algorithms, and models for stacking ensemble learning using performance metrics. IEEE TVCG 27, 2 (2021), 1547–1557. doi:10.1109/TVCG.2020.3030352.
  17. FeatureEnVi: Visual analytics for feature engineering using stepwise selection and semi-automatic extraction approaches. IEEE TVCG 28, 4 (2022), 1773–1791. doi:10.1109/TVCG.2022.3141040.
  18. HardVis: Visual analytics to handle instance hardness using undersampling and oversampling techniques. Computer Graphics Forum 42, 1 (2023), 135–154. doi:https://doi.org/10.1111/cgf.14726.
  19. Collaris D., van Wijk J.: StrategyAtlas: Strategy analysis for machine learning interpretability. IEEE TVCG (2022), 1–13. To appear. doi:10.1109/TVCG.2022.3146806.
  20. Colin Cameron A., Windmeijer F. A.: An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics 77, 2 (1997), 329–342. doi:10.1016/S0304-4076(96)01818-0.
  21. D3 — Data-driven documents, 2011. Accessed January 24, 2024. URL: https://d3js.org/.
  22. Deng J., Brown E. T.: RISSAD: Rule-based interactive semi-supervised anomaly detection. In Proceedings of the EuroVis 2021 – Short Papers (2021), The Eurographics Association. doi:10.2312/evs.20211050.
  23. Di Castro F., Bertini E.: Surrogate decision tree visualization. In Proceedings of the CEUR Workshop (2019), vol. 2327, CEUR-WS.
  24. DeforestVis Code, 2023. Accessed January 24, 2024. URL: https://github.com/angeloschatzimparmpas/DeforestVis.
  25. Dua D., Graff C.: UCI machine learning repository, 2017. Accessed January 24, 2024. URL: http://archive.ics.uci.edu/ml.
  26. Do T.-N.: Towards simple, easy to understand, an interactive decision tree algorithm. College Information Technology Can tho University, Can Tho, Vietnam, Technology Report (2007), 06–01.
  27. A nested hierarchy of localized scatterplots. In Proceedings of the 27th SIBGRAPI Conference on Graphics, Patterns and Images (2014), pp. 80–86. doi:10.1109/SIBGRAPI.2014.14.
  28. On the interpretability of machine learning-based model for predicting hypertension. BMC Medical Informatics and Decision Making 19, 1 (2019), 1–32.
  29. Elmqvist N., Fekete J.-D.: Hierarchical aggregation for information isualization: Overview, techniques, and design guidelines. IEEE Transactions on Visualization and Computer Graphics 16, 3 (2010), 439–454. doi:10.1109/TVCG.2009.84.
  30. RfX: A design study for the interactive exploration of a random forest to enhance testing procedures for electrical engines. Computer Graphics Forum 41, 6 (2022), 302–315. doi:10.1111/cgf.14452.
  31. Flask — A micro web framework written in Python, 2010. Accessed January 24, 2024. URL: https://flask.palletsprojects.com/.
  32. Friedman J. H.: Greedy function approximation: A gradient boosting machine. Annals of Statistics (2001), 1189–1232.
  33. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14, 5 (Sept. 1999), 771–780.
  34. Frank E., Witten I. H.: Generating accurate rule sets without global optimization. In Proceedings of the Fifteenth International Conference on Machine Learning (1998), ICML ’98, Morgan Kaufmann Publishers Inc., p. 144–151.
  35. Visualizing change over time using dynamic hierarchies: TreeVersity2 and the StemView. IEEE TVCG 19, 12 (2013), 2566–2575. doi:10.1109/TVCG.2013.231.
  36. Variable selection using random forests. Pattern Recognition Letters 31, 14 (2010), 2225–2236.
  37. Han J., Cercone N.: RuleViz: A model for visualizing knowledge discovery process. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (2000), KDD ’00, ACM, p. 244–253. doi:10.1145/347090.347139.
  38. Gamut: A design probe to understand how data scientists understand machine learning models. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2019), ACM. doi:10.1145/3290605.3300809.
  39. Visual analytics in deep learning: An interrogative survey for the next frontiers. IEEE TVCG 25, 8 (2019), 2674–2693. doi:10.1109/TVCG.2018.2843369.
  40. GBRTVis: Online analysis of gradient boosting regression tree. Journal of Visualization 22, 1 (Feb. 2019), 125–140. doi:10.1007/s12650-018-0514-2.
  41. Overlap removal of dimensionality reduction scatterplot layouts. ArXiv e-prints 1903.06262 (2019). arXiv:1903.06262.
  42. Hastie T., Tibshirani R.: Generalized additive models. Statistical Science 1, 3 (1986), 297 – 310. URL: https://doi.org/10.1214/ss/1177013604, doi:10.1214/ss/1177013604.
  43. Iba W., Langley P.: Induction of one-level decision trees. In Machine Learning Proceedings 1992. Morgan Kaufmann, San Francisco (CA), 1992, pp. 233–240. doi:10.1016/B978-1-55860-247-2.50035-8.
  44. Dimensionality explorer for single-cell analysis. In 2023 IEEE 16th Pacific Visualization Symposium (PacificVis) (2023), pp. 51–60. doi:10.1109/PacificVis56936.2023.00013.
  45. Visualizing surrogate decision trees of convolutional neural networks. Journal of Visualization 23, 1 (2020), 141–156. doi:10.1007/s12650-019-00607-z.
  46. EVM: Incorporating model checking into exploratory visual analysis, 2024. To appear.
  47. An interactive machine learning framework, 2016. arXiv:1610.05463.
  48. Lundberg S. M., Lee S.-I.: A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems (2017), vol. 30, Curran Associates, Inc.
  49. Visual diagnosis of tree boosting methods. IEEE TVCG 24, 1 (Jan. 2018), 163–173. doi:10.1109/TVCG.2017.2744378.
  50. TreeJuxtaposer: Scalable tree comparison using focus+context with guaranteed visibility. ACM Transactions on Graphics 22, 3 (July 2003), 453–462. doi:10.1145/882262.882291.
  51. UMAP: Uniform manifold approximation and projection for dimension reduction. ArXiv e-prints 1802.03426 (Feb. 2018). arXiv:1802.03426.
  52. ExplorerTree: A focus+context exploration approach for 2D embeddings. Big Data Research 25 (2021), 100239. doi:10.1016/j.bdr.2021.100239.
  53. TreePOD: Sensitivity-aware selection of pareto-optimal decision trees. IEEE TVCG 24, 1 (2018), 174–183. doi:10.1109/TVCG.2017.2745158.
  54. Molnar C.: Interpretable Machine Learning. Independently Published, 2020.
  55. RuleMatrix: Visualizing and understanding classifiers with rules. IEEE TVCG 25, 1 (2019), 342–352. doi:10.1109/TVCG.2018.2864812.
  56. Munzner T.: A nested model for visualization design and validation. IEEE TVCG 15, 6 (2009), 921–928. doi:10.1109/TVCG.2009.111.
  57. Munzner T.: Visualization analysis and design. CRC press, 2014.
  58. Explaining vulnerabilities to adversarial machine learning through visual analytics. IEEE TVCG 26, 1 (Jan. 2020), 1075–1085. doi:10.1109/TVCG.2019.2934631.
  59. A visualization tool for interactive learning of large decision trees. In Proceedings of the 12th IEEE Internationals Conference on Tools with Artificial Intelligence (2000), ICTAI ’00, IEEE, pp. 28–35. doi:10.1109/TAI.2000.889842.
  60. InterpretML: A unified framework for machine learning interpretability. ArXiv e-prints 1909.09223 (Sep. 2019).
  61. Neto M. P., Paulovich F. V.: Explainable Matrix - Visualization for global and local interpretability of random forest classification ensembles. IEEE TVCG 27, 2 (2021), 1427–1437. doi:10.1109/TVCG.2020.3030354.
  62. Neto M. P., Paulovich F. V.: Multivariate data explanation by Jumping Emerging Patterns visualization. IEEE TVCG (2022), 1–16. To appear. doi:10.1109/TVCG.2022.3223529.
  63. Colorful Trees: Visualizing random forests for analysis and interpretation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (2019), WACV ’19, IEEE, pp. 294–302. doi:10.1109/WACV.2019.00037.
  64. Plotly — JavaScript open source graphing library, 2010. Accessed January 24, 2024. URL: https://plotly.com.
  65. FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision making 12, 4 (2017), 344–368.
  66. Interactive exploration of parameter space in data mining: Comprehending the predictive quality of large decision tree collections. Computers & Graphics 41 (2014), 99–113. doi:10.1016/j.cag.2014.02.004.
  67. Scikit-Learn: Machine learning in Python. Journal of Machine Learning Research 12 (Nov. 2011), 2825–2830. doi:10.5555/1953048.2078195.
  68. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), KDD ’16, ACM, pp. 1135–1144. doi:10.1145/2939672.2939778.
  69. Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI Conference on Artificial Intelligence (Apr. 2018), vol. 32. doi:10.1609/aaai.v32i1.11491.
  70. Schapire R. E.: A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2 (1999), IJCAI’99, Morgan Kaufmann Publishers Inc., p. 1401–1406.
  71. Multiple foci visualisation of large hierarchies with FlexTree. Information Visualization 3, 1 (2004), 19–35. doi:10.1057/palgrave.ivs.9500065.
  72. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium Computer Application in Medical Care (1988), American Medical Informatics Association, pp. 261–265.
  73. Engineering design via surrogate modelling: A practical guide. John Wiley & Sons, 2008.
  74. Safavian S., Landgrebe D.: A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man, and Cybernetics 21, 3 (1991), 660–674. doi:10.1109/21.97458.
  75. Sato M., Tsukimoto H.: Rule extraction from neural networks via decision tree induction. In Proceedings of the International Joint Conference on Neural Networks (2001), vol. 3 of IJCNN ’01, pp. 1870–1875 vol.3. doi:10.1109/IJCNN.2001.938448.
  76. FacetRules: Discovering and describing related groups. In Proceedings of the IEEE Workshop on Machine Learning from User Interactions (2021), MLUI ’21, pp. 21–26. doi:10.1109/MLUI54255.2021.00008.
  77. An analysis of machine- and human-analytics in classification. IEEE TVCG 23, 1 (2017), 71–80. doi:10.1109/TVCG.2016.2598829.
  78. Teoh S. T., Ma K.-L.: PaintingClass: Interactive construction, visualization and exploration of decision trees. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (2003), KDD ’03, ACM, p. 667–672. doi:10.1145/956750.956837.
  79. Teoh S. T., Ma K.-L.: StarClass: Interactive visual classification using star coordinates. In Proceedings of the 2003 SIAM International Conference on Data Mining (2003), SIAM, pp. 178–185. doi:10.1137/1.9781611972733.16.
  80. ARMatrix: An interactive item-to-rule matrix for association rules visual analytics. Electronics 11, 9 (2022). doi:10.3390/electronics11091344.
  81. van den Elzen S., van Wijk J. J.: BaobabView: Interactive construction and analysis of decision trees. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (2011), VAST ’11, IEEE, pp. 151–160. doi:10.1109/VAST.2011.6102453.
  82. van der Maaten L., Hinton G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.
  83. Evaluating fidelity of explainable methods for predictive process analytics. In Proceedings of the International Conference on Advanced Information Systems Engineering (2021), Springer, pp. 64–72.
  84. Vue.js — The progressive JavaScript framework, 2014. Accessed January 24, 2024. URL: https://vuejs.org/.
  85. Interactive machine learning: Letting users build classifiers. International Journal of Human-Computer Studies 55, 3 (2001), 281–292. doi:10.1006/ijhc.2001.0499.
  86. Interpretability, then what? Editing machine learning models to reflect human knowledge and values. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2022), KDD ’22, Association for Computing Machinery, p. 4132–4142. doi:10.1145/3534678.3539074.
  87. Wolpert D. H.: Stacked generalization. Neural Networks 5, 2 (1992), 241–259. doi:10.1016/S0893-6080(05)80023-1.
  88. Investigating the evolution of tree boosting models with visual analytics. In Proceedings of the 14th IEEE Pacific Visualization Symposium (2021), PacificVis ’21, IEEE, pp. 186–195. doi:10.1109/PacificVis52677.2021.00032.
  89. GBMVis: Visual analytics for interpreting gradient boosting machine. In Proceedings of the Cooperative Design, Visualization, and Engineering: 18th International Conference (2021), CDVE ’21, Springer, pp. 63–72. doi:10.1007/978-3-030-88207-5_7.
  90. EnsembleLens: Ensemble-based visual exploration of anomaly detection algorithms with multidimensional data. IEEE TVCG 25, 1 (Jan. 2019), 109–119. doi:10.1109/TVCG.2018.2864825.
  91. Visual exploration of machine learning model behavior with hierarchical surrogate rule sets. IEEE TVCG (2022), 1–18. To appear. doi:10.1109/TVCG.2022.3219232.
  92. SUBPLEX: A visual analytics approach to understand local model explanations at the subpopulation level. IEEE Computer Graphics and Applications 42, 6 (2022), 24–36. doi:10.1109/MCG.2022.3199727.
  93. An exploration and validation of visual factors in understanding classification rule sets. In Proceedings of the IEEE Visualization Conference (2021), VIS ’21, pp. 6–10. doi:10.1109/VIS49827.2021.9623303.
  94. Evaluation of sampling methods for scatterplots. IEEE Transactions on Visualization and Computer Graphics 27, 2 (2021), 1720–1730. doi:10.1109/TVCG.2020.3030432.
  95. iForest: Interpreting random forests via visual analytics. IEEE TVCG 25, 1 (Jan. 2019), 407–416. doi:10.1109/TVCG.2018.2864475.
  96. Interpreting CNNs via decision trees. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (June 2019), CVPR ’19, IEEE Computer Society, pp. 6254–6263. doi:10.1109/CVPR.2019.00642.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com