Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenHEXAI: An Open-Source Framework for Human-Centered Evaluation of Explainable Machine Learning (2403.05565v1)

Published 20 Feb 2024 in cs.HC and cs.AI

Abstract: Recently, there has been a surge of explainable AI (XAI) methods driven by the need for understanding machine learning model behaviors in high-stakes scenarios. However, properly evaluating the effectiveness of the XAI methods inevitably requires the involvement of human subjects, and conducting human-centered benchmarks is challenging in a number of ways: designing and implementing user studies is complex; numerous design choices in the design space of user study lead to problems of reproducibility; and running user studies can be challenging and even daunting for machine learning researchers. To address these challenges, this paper presents OpenHEXAI, an open-source framework for human-centered evaluation of XAI methods. OpenHEXAI features (1) a collection of diverse benchmark datasets, pre-trained models, and post hoc explanation methods; (2) an easy-to-use web application for user study; (3) comprehensive evaluation metrics for the effectiveness of post hoc explanation methods in the context of human-AI decision making tasks; (4) best practice recommendations of experiment documentation; and (5) convenient tools for power analysis and cost estimation. OpenHEAXI is the first large-scale infrastructural effort to facilitate human-centered benchmarks of XAI methods. It simplifies the design and implementation of user studies for XAI methods, thus allowing researchers and practitioners to focus on the scientific questions. Additionally, it enhances reproducibility through standardized designs. Based on OpenHEXAI, we further conduct a systematic benchmark of four state-of-the-art post hoc explanation methods and compare their impacts on human-AI decision making tasks in terms of accuracy, fairness, as well as users' trust and understanding of the machine learning model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Explainable machine learning challenge. https://community.fico.com/s/explainable-machine-learning-challenge?tabset-158d9=3. Accessed: 2023-06-05.
  2. Give me some credit :: 2011 competition data. https://www.kaggle.com/datasets/brycecf/give-me-some-credit-dataset. Accessed: 2023-06-05.
  3. Cogam: measuring and moderating cognitive load in machine learning model explanations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–14, 2020.
  4. Openxai: Towards a transparent evaluation of model explanations. Advances in Neural Information Processing Systems, 2022.
  5. Evaluating saliency map explanations for convolutional neural networks: a user study. In Proceedings of the 25th International Conference on Intelligent User Interfaces, pages 275–285, 2020a.
  6. Evaluating saliency map explanations for convolutional neural networks: a user study. pages 275–285, 2020b.
  7. Towards better understanding of gradient-based attribution methods for deep neural networks. In 6th International Conference on Learning Representations (ICLR), 2018.
  8. Data-centric explanations: Explaining training data of machine learning systems to promote transparency. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–13, 2021.
  9. How can explainability methods be used to support bug identification in computer vision models? In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–16, 2022.
  10. Does the whole exceed its parts? the effect of AI explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021.
  11. Fairness in machine learning. Nips tutorial, 1:2017, 2017.
  12. ’it’s reducing a human being to a percentage’ perceptions of justice in algorithmic decisions. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 1–14, 2018.
  13. Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces, pages 454–464, 2020a.
  14. Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems. In Proceedings of the 25th international conference on intelligent user interfaces, pages 454–464, 2020b.
  15. To trust or to think: cognitive forcing functions can reduce overreliance on ai in ai-assisted decision-making. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1):1–21, 2021.
  16. The role of explanations on trust and reliance in clinical decision support systems. In 2015 international conference on healthcare informatics, pages 160–169. IEEE, 2015.
  17. Do explanations make VQA models more predictable to a human? In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
  18. Use-case-grounded simulations for explanation evaluation. Advances in Neural Information Processing Systems, 2022.
  19. Explaining decision-making algorithms through ui: Strategies to help non-expert stakeholders. In Proceedings of the 2019 chi conference on human factors in computing systems, 2019.
  20. I think I get your point, AI! the illusion of explanatory depth in explainable AI. In 26th International Conference on Intelligent User Interfaces, 2021.
  21. What I cannot predict, I do not understand: A Human-Centered evaluation framework for explainability methods. In Advances in Neural Information Processing Systems, 2022.
  22. Algorithm aversion: people erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1):114, 2015.
  23. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  24. Human-ai collaboration for ux evaluation: Effects of explanation and synchronization. Proceedings of the ACM on Human-Computer Interaction, 2022.
  25. Datasheets for datasets. Communications of the ACM, 2021.
  26. Mental models of ai agents in a cooperative game setting. In Proceedings of the 2020 chi conference on human factors in computing systems, pages 1–12, 2020.
  27. Disparate interactions: An algorithm-in-the-loop analysis of fairness in risk assessments. In Proceedings of the conference on fairness, accountability, and transparency, pages 90–99, 2019a.
  28. The principles and limits of algorithm-in-the-loop decision making. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):1–24, 2019b.
  29. Visualizing uncertainty and alternatives in event sequence predictions. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2019a.
  30. Visualizing uncertainty and alternatives in event sequence predictions. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2019b.
  31. Improving understandability of feature contributions in model-agnostic explainable ai tools. 2022.
  32. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.
  33. The effect of race/ethnicity on sentencing: Examining sentence type, jail length, and prison length. Journal of Ethnicity in Criminal Justice, 2015.
  34. Impact of a deep learning assistant on the histopathologic classification of liver cancer. NPJ Digit Med, 2020.
  35. Tell me more? the effects of mental model soundness on personalizing an intelligent agent. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1–10, 2012.
  36. Towards a science of human-ai decision making: a survey of empirical studies. arXiv preprint arXiv:2112.11471, 2021.
  37. Understanding the Effect of Out-of-distribution Examples and Interactive Explanations on Human-AI Decision Making. Proc. ACM Hum.-Comput. Interact., 2021.
  38. Why does my model fail? contrastive local explanations for retail forecasting. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 90–98, 2020.
  39. A unified approach to interpreting model predictions. Advances in neural information processing systems, 2017.
  40. Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), 2021.
  41. Dong Nguyen. Comparing automatic and human evaluation of local explanations for text classification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018a.
  42. Dong Nguyen. Comparing automatic and human evaluation of local explanations for text classification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1069–1078, 2018b.
  43. Manipulating and measuring model interpretability. arXiv preprint arXiv:1802.07810, 2018.
  44. Explanations as mechanisms for supporting algorithmic transparency. In Proceedings of the 2018 CHI conference on human factors in computing systems, pages 1–13, 2018.
  45. " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016.
  46. Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  47. Predicting recidivism in north carolina, 1978 and 1980. Inter-university Consortium for Political and Social Research, 1988.
  48. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 2017.
  49. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  50. Assessing the local interpretability of machine learning models. arXiv preprint arXiv:1902.03501, 2019.
  51. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
  52. No explainability without accountability: An empirical study of explanations and feedback in interactive ml. In Proceedings of CHI, 2020.
  53. Progressive disclosure: designing for effective transparency. arXiv preprint arXiv:1811.02164, 2018.
  54. Axiomatic attribution for deep networks. In International conference on machine learning. PMLR, 2017.
  55. Visual, textual or hybrid: the effect of user expertise on different explanations. In 26th International Conference on Intelligent User Interfaces, pages 109–119, 2021.
  56. Exploring and promoting diagnostic transparency and explainability in online symptom checkers. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–17, 2021.
  57. Effect of information presentation on fairness perceptions of machine learning predictors. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–13, 2021.
  58. Are explanations helpful? a comparative study of the effects of explanations in ai-assisted decision-making. In 26rd International Conference on Intelligent User Interfaces, 2021.
  59. A human-grounded evaluation of shap for alert processing. arXiv preprint arXiv:1907.03324, 2019.
  60. How do visual explanations foster end users’ appropriate trust in machine learning? In Proceedings of the 25th International Conference on Intelligent User Interfaces, pages 189–201, 2020.
  61. A psychological theory of explainability. In International Conference on Machine Learning. PMLR, 2022.
  62. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. In Expert Systems with Applications, 2009.
  63. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. Springer, 2014.
  64. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020.
  65. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  66. Exsum: From local explanations to model understanding. arXiv preprint arXiv:2205.00130, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiaqi Ma (82 papers)
  2. Vivian Lai (28 papers)
  3. Yiming Zhang (128 papers)
  4. Chacha Chen (17 papers)
  5. Paul Hamilton (17 papers)
  6. Davor Ljubenkov (1 paper)
  7. Himabindu Lakkaraju (88 papers)
  8. Chenhao Tan (89 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com