Quality Assurance for Artificial Intelligence: A Study of Industrial Concerns, Challenges and Best Practices (2402.16391v1)
Abstract: Quality Assurance (QA) aims to prevent mistakes and defects in manufactured products and avoid problems when delivering products or services to customers. QA for AI systems, however, poses particular challenges, given their data-driven and non-deterministic nature as well as more complex architectures and algorithms. While there is growing empirical evidence about practices of machine learning in industrial contexts, little is known about the challenges and best practices of quality assurance for AI systems (QA4AI). In this paper, we report on a mixed-method study of QA4AI in industry practice from various countries and companies. Through interviews with fifteen industry practitioners and a validation survey with 50 practitioner responses, we studied the concerns as well as challenges and best practices in ensuring the QA4AI properties reported in the literature, such as correctness, fairness, interpretability and others. Our findings suggest correctness as the most important property, followed by model relevance, efficiency and deployability. In contrast, transferability (applying knowledge learned in one task to another task), security and fairness are not paid much attention by practitioners compared to other properties. Challenges and solutions are identified for each QA4AI property. For example, interviewees highlighted the trade-off challenge among latency, cost and accuracy for efficiency (latency and cost are parts of efficiency concern). Solutions like model compression are proposed. We identified 21 QA4AI practices across each stage of AI development, with 10 practices being well recognized and another 8 practices being marginally agreed by the survey practitioners.
- [n. d.]. Apache Ignite. https://ignite.apache.org/
- [n. d.]. Apache Spark. https://spark.apache.org/
- [n. d.]. ChatGPT is easily abused, and that’s a big problem. https://adguard.com/en/blog/chatgpt-dan-prompt-abuse.html
- [n. d.]. Kubernetes. https://kubernetes.io/
- [n. d.]. NVIDIA CUDA toolkit. https://developer.nvidia.com/cuda-toolkit
- [n. d.]. NVIDIA TensorRT. https://developer.nvidia.com/tensorrt
- [n. d.]. NVIDIA Triton Inference Server. https://developer.nvidia.com/nvidia-triton-inference-server
- [n. d.]. Personal Data Protection Act. https://www.pdpc.gov.sg/Overview-of-PDPA/The-Legislation/Personal-Data-Protection-Act
- [n. d.]. Pinecone. https://www.pinecone.io/
- [n. d.]. PyTorch. https://pytorch.org/
- [n. d.]. Seldon. https://www.seldon.io/
- [n. d.]. TensorFlow. https://www.tensorflow.org/
- 2014. History of the Basel Committee. https://www.bis.org/bcbs/history.htm
- 2015. ISO 9001:2015. https://www.iso.org/standard/62085.html
- 2022. General Data Protection Regulation (GDPR). https://gdpr-info.eu/
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318.
- Ibrahim M Ahmed and Manar Younis Kashmoola. 2021. Threats on machine learning technique by data poisoning attack: A survey. In Advances in Cyber Security: Third International Conference, ACeS 2021, Penang, Malaysia, August 24–25, 2021, Revised Selected Papers 3. Springer, 586–600.
- Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 291–300. https://doi.org/10.1109/ICSE-SEIP.2019.00042
- Shin Ando and Chun-Yuan Huang. 2017. Deep Over-sampling Framework for Classifying Imbalanced Data. arXiv:1704.07515 [cs.LG]
- Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. arXiv:1910.10045 [cs.AI]
- BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems. IEEE Transactions on Software Engineering 48, 12 (2022), 5087–5101. https://doi.org/10.1109/TSE.2021.3136169
- Artificial intelligence and fraud detection. Innovative Technology at the Interface of Finance and Operations: Volume I (2022), 223–247.
- Scalable AI. (9 2021). https://doi.org/10.1184/R1/16560273.v1
- Role of artificial intelligence in cloud computing, IoT and SDN: Reliability and scalability issues. International Journal of Electrical and Computer Engineering 11, 5 (2021), 4458.
- Dream Distillation: A Data-Independent Model Compression Framework. arXiv:1905.07072 [stat.ML]
- The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. In Proceedings of IEEE Big Data.
- Real-world Machine Learning Systems: A survey from a Data-Oriented Architecture Perspective. arXiv:2302.04810 [cs.SE]
- Longbing Cao. 2021. AI in Finance: Challenges, Techniques and Opportunities. arXiv:2107.09051 [q-fin.CP]
- Longbing Cao. 2022. AI in Finance: Challenges, Techniques, and Opportunities. ACM Comput. Surv. 55, 3, Article 64 (feb 2022), 38 pages. https://doi.org/10.1145/3502289
- Extracting Training Data from Large Language Models. arXiv:2012.07805 [cs.CR]
- SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16 (jun 2002), 321–357. https://doi.org/10.1613/jair.953
- A novel sequential design strategy for global surrogate modeling. In Proceedings of the 2009 Winter Simulation Conference (WSC). 731–742. https://doi.org/10.1109/WSC.2009.5429687
- Daniela S. Cruzes and Tore Dyba. 2011. Recommended Steps for Thematic Synthesis in Software Engineering. In Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement (ESEM '11). IEEE Computer Society, USA, 275–284. https://doi.org/10.1109/ESEM.2011.36
- Arun Das and Paul Rad. 2020. Opportunities and challenges in explainable artificial intelligence (xai): A survey. arXiv preprint arXiv:2006.11371 (2020).
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
- What makes a popular academic AI repository? Empirical Software Engineering 26, 1 (2021), 1–35.
- Michael Felderer and Rudolf Ramler. 2021. Quality Assurance for AI-Based Systems: Overview and Challenges (Introduction to Interactive Session). In Software Quality: Future Perspectives on Software Engineering Quality. Springer International Publishing, 33–42. https://doi.org/10.1007/978-3-030-65854-0_3
- DeepGini: Prioritizing Massive Tests to Enhance the Robustness of Deep Neural Networks. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual Event, USA) (ISSTA 2020). Association for Computing Machinery, New York, NY, USA, 177–188. https://doi.org/10.1145/3395363.3397357
- Fair AI: Challenges and opportunities. Business & information systems engineering 62 (2020), 379–384.
- All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously. arXiv:1801.01489 [stat.ME]
- Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. 1322–1333.
- Jerome Friedman. 2000. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics 29 (11 2000). https://doi.org/10.1214/aos/1013203451
- Model Compression for IoT Applications in Industry 4.0 via Multiscale Knowledge Transfer. IEEE Transactions on Industrial Informatics 16, 9 (2020), 6013–6022. https://doi.org/10.1109/TII.2019.2953106
- Astraea: Deploy AI Services at the Edge in Elegant Ways. In 2020 IEEE International Conference on Edge Computing (EDGE). 49–53. https://doi.org/10.1109/EDGE50951.2020.00015
- What Causes Exceptions in Machine Learning Applications? Mining Machine Learning-Related Stack Traces on Stack Overflow. arXiv:2304.12857 [cs.LG]
- Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation. arXiv:1309.6392 [stat.AP]
- Curiosity-Driven and Victim-Aware Adversarial Policies. In Proceedings of the 38th Annual Computer Security Applications Conference (Austin, TX, USA) (ACSAC '22). Association for Computing Machinery, New York, NY, USA, 186–200. https://doi.org/10.1145/3564625.3564636
- Generative Adversarial Networks. arXiv:1406.2661 [stat.ML]
- Leo Goodman. 1961. Snowball Sampling. Ann Math Stat 32 (03 1961). https://doi.org/10.1214/aoms/1177705148
- Serge Gorbunov and Arnold Rosenbloom. 2010. Autofuzz: Automated network protocol fuzzing framework. Ijcsns 10, 8 (2010), 239.
- Predicting Electricity Distribution Feeder Failures Using Machine Learning Susceptibility Analysis. In IAAI. http://www.phillong.info/publications/GBAetal06_susc.pdf
- How Many Interviews Are Enough?: An Experiment with Data Saturation and Variability. Field Methods 18, 1 (Feb. 2006), 59–82. https://doi.org/10.1177/1525822X05279903 Publisher: SAGE Publications Inc.
- Dynamic Task Prioritization for Multitask Learning. In Proceedings of the European Conference on Computer Vision (ECCV).
- Robustness and explainability of artificial intelligence. Publications Office of the European Union 207 (2020).
- Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Proceedings of the 2005 International Conference on Advances in Intelligent Computing - Volume Part I (Hefei, China) (ICIC'05). Springer-Verlag, Berlin, Heidelberg, 878–887. https://doi.org/10.1007/11538059_91
- A systematic review of the diagnostic accuracy of artificial intelligence-based computer programs to analyze chest x-rays for pulmonary tuberculosis. PloS one 14, 9 (2019), e0221339.
- Random and Synthetic Over-Sampling Approach to Resolve Data Imbalance in Classification. International Journal of Artificial Intelligence Research 4 (01 2021), 86. https://doi.org/10.29099/ijair.v4i2.152
- Model inversion attacks against collaborative inference. In Proceedings of the 35th Annual Computer Security Applications Conference. 148–162.
- Support vector machines. IEEE Intelligent Systems and their Applications 13, 4 (1998), 18–28. https://doi.org/10.1109/5254.708428
- Design Patterns for AI-based Systems: A Multivocal Literature Review and Pattern Repository. arXiv:2303.13173 [cs.SE]
- Assessment Framework for Deployability of Machine Learning Models in Production. Procedia CIRP 118 (2023), 32–37. https://doi.org/10.1016/j.procir.2023.06.007 16th CIRP Conference on Intelligent Computation in Manufacturing Engineering.
- Requirement Engineering Challenges for AI-intense Systems Development. In 2021 IEEE/ACM 1st Workshop on AI Engineering - Software Engineering for AI (WAIN). 89–96. https://doi.org/10.1109/WAIN52551.2021.00020
- Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [stat.ML]
- Area based stratified random sampling using geospatial technology in a community-based survey. BMC Public Health 20 (11 2020). https://doi.org/10.1186/s12889-020-09793-0
- Krystal Hu. 2023. CHATGPT sets record for fastest-growing user base - analyst note. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
- Shotaro Ishihara. 2023. Training Data Extraction From Pre-trained Language Models: A Survey. arXiv:2305.16157 [cs.CL]
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. arXiv:1712.05877 [cs.LG]
- From reality to world. A critical perspective on AI fairness. Journal of Business Ethics 178, 4 (July 2022), 945–959. https://doi.org/10.1007/s10551-022-05055-8 FNEGE 1, HCERES A, ABS 3.
- Catch me if you can: performance bug detection in the wild. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications. 155–170.
- Reza karemi and mohammadreza nasiri. 2023. Identifying and Prioritizing Factors Affecting Knowledge Sharing in Software Companies. Sciences and Techniques of Information Management (2023), –. https://doi.org/10.22091/stim.2023.10146.2043
- Sanjay Kariyappa and Moinuddin K Qureshi. 2019. Defending Against Model Stealing Attacks with Adaptive Misinformation. arXiv:1911.07100 [stat.ML]
- Guiding Deep Learning System Testing Using Surprise Adequacy. In Proceedings of the 41st International Conference on Software Engineering (Montreal, Quebec, Canada) (ICSE '19). IEEE Press, 1039–1049. https://doi.org/10.1109/ICSE.2019.00108
- Guiding Deep Learning System Testing Using Surprise Adequacy. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE. https://doi.org/10.1109/icse.2019.00108
- Segment Anything. arXiv:2304.02643 [cs.CV]
- Practitioners' Views on Good Software Testing Practices. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 61–70. https://doi.org/10.1109/ICSE-SEIP.2019.00015
- Defending Against Machine Learning Model Stealing Attacks Using Deceptive Perturbations. arXiv:1806.00054 [cs.LG]
- A Survey on Deep Learning for Named Entity Recognition. IEEE Transactions on Knowledge and Data Engineering 34, 1 (jan 2022), 50–70. https://doi.org/10.1109/tkde.2020.2981314
- A Qualitative Study on the Implementation Design Decisions of Developers. arXiv:2301.09789 [cs.SE]
- Adversarial Attacks on Large Language Model-Based System and Mitigating Strategies: A Case Study on ChatGPT. Sec. and Commun. Netw. 2023 (jan 2023), 10 pages. https://doi.org/10.1155/2023/8691095
- Responsible AI Pattern Catalogue: A Collection of Best Practices for AI Governance and Engineering. ACM Comput. Surv. (oct 2023). https://doi.org/10.1145/3626234 Just Accepted.
- Scott Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. arXiv:1705.07874 [cs.AI]
- A Taxonomy of Software Engineering Challenges for Machine Learning Systems: An Empirical Investigation. In Agile Processes in Software Engineering and Extreme Programming, Philippe Kruchten, Steven Fraser, and François Coallier (Eds.). Springer International Publishing, Cham, 227–243.
- Software Engineering for AI-Based Systems: A Survey. ACM Trans. Softw. Eng. Methodol. 31, 2, Article 37e (apr 2022), 59 pages. https://doi.org/10.1145/3487043
- A Survey on Bias and Fairness in Machine Learning. arXiv:1908.09635 [cs.LG]
- Marvin Minsky. 1961. Steps toward Artificial Intelligence. Proceedings of the IRE 49, 1 (1961), 8–30. https://doi.org/10.1109/JRPROC.1961.287775
- Alhassan Mumuni and Fuseini Mumuni. 2022. Data augmentation: A comprehensive survey of modern approaches. Array 16 (2022), 100258. https://doi.org/10.1016/j.array.2022.100258
- An Approach to Software Testing of Machine Learning Applications. 167–.
- Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process. arXiv:2110.10234 [cs.SE]
- Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks. In Machine Learning for Healthcare Conference. PMLR, 381–405.
- Task Weighting in Meta-learning with Trajectory Optimisation. arXiv:2301.01400 [cs.LG]
- PropFuzz—An IT-security fuzzing framework for proprietary ICS protocols. In 2017 International conference on applied electronics (AE). IEEE, 1–4.
- AI robustness analysis with consideration of corner cases. In 2021 IEEE International Conference on Artificial Intelligence Testing (AITest). 29–36. https://doi.org/10.1109/AITEST52744.2021.00016
- Towards better data discovery and collection with flow-based programming. arXiv preprint arXiv:2108.04105 (2021).
- Challenges in Deploying Machine Learning: A Survey of Case Studies. ACM Comput. Surv. 55, 6, Article 114 (dec 2022), 29 pages. https://doi.org/10.1145/3533378
- Adaptive cruise control for an intelligent vehicle. In 2008 IEEE International Conference on Robotics and Biomimetics. 1794–1799. https://doi.org/10.1109/ROBIO.2009.4913274
- Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
- Tuning deep neural network’s hyperparameters constrained to deployability on tiny systems. In International Conference on Artificial Neural Networks. Springer, 92–103.
- Michael Pradel and Koushik Sen. 2018. Deepbugs: A learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1–25.
- Mohammad Rizky Pratama and Dana Sulistiyo Kusumo. 2021. Implementation of continuous integration and continuous delivery (ci/cd) on automatic performance testing. In 2021 9th International Conference on Information and Communication Technology (ICoICT). IEEE, 230–235.
- The Associated Press. 2022. Nearly 400 car crashes in 11 months involved Automated Tech, companies tell regulators. https://www.npr.org/2022/06/15/1105252793/nearly-400-car-crashes-in-11-months-involved-automated-tech-companies-tell-regul
- AequeVox: Automated Fairness Testing of Speech Recognition Systems. In Fundamental Approaches to Software Engineering, Einar Broch Johnsen and Manuel Wimmer (Eds.). Springer International Publishing, 245–267.
- Satyendra Singh Rawat and Amit Kumar Mishra. 2022. Review of Methods for Handling Class-Imbalanced in Classification Problems. arXiv:2211.05456 [cs.LG]
- "Why Should I Trust You?": Explaining the Predictions of Any Classifier. arXiv:1602.04938 [cs.LG]
- Noga H. Rotman. 2023. Tackling Deployability Challenges in ML-Powered Networks. SIGMETRICS Perform. Eval. Rev. 51, 2 (oct 2023), 94–96. https://doi.org/10.1145/3626570.3626605
- Automated Vulnerability Detection in Source Code Using Deep Representation Learning. arXiv:1807.04320 [cs.LG]
- Pruning at a Glance: Global Neural Pruning for Model Compression. arXiv:1912.00200 [cs.CV]
- Green AI. Commun. ACM 63, 12 (nov 2020), 54–63. https://doi.org/10.1145/3381831
- Adoption and Effects of Software Engineering Best Practices in Machine Learning. In Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (Bari, Italy) (ESEM '20). Association for Computing Machinery, New York, NY, USA, Article 3, 12 pages. https://doi.org/10.1145/3382494.3410681
- BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT. arXiv:2304.12298 [cs.CR]
- How Re-sampling Helps for Long-Tail Learning? arXiv:2310.18236 [cs.CV]
- Deep model transferability from attribution maps. Advances in Neural Information Processing Systems 32 (2019).
- Exploring ML Testing in Practice: Lessons Learned from an Interactive Rapid Review with Axis Communications. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI (Pittsburgh, Pennsylvania) (CAIN '22). Association for Computing Machinery, New York, NY, USA, 10–21. https://doi.org/10.1145/3522664.3528596
- Exploring ML testing in practice – Lessons learned from an interactive rapid review with Axis Communications. arXiv:2203.16225 [cs.SE]
- Anselm Strauss and Juliet M. Corbin. 1990. Basics of qualitative research: Grounded theory procedures and techniques. Sage Publications, Inc, Thousand Oaks, CA, US. Pages: 270.
- How to fine-tune bert for text classification?. In Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October 18–20, 2019, Proceedings 18. Springer, 194–206.
- Data poisoning attacks on federated machine learning. IEEE Internet of Things Journal 9, 13 (2021), 11365–11375.
- Text Augmentation based Imbalance Learning for Unstructured Text Data. In 2022 IEEE 4th International Conference on Cybernetics, Cognition and Machine Learning Applications (ICCCMLA). 73–77. https://doi.org/10.1109/ICCCMLA56841.2022.9989047
- "STILL AROUND": Experiences and Survival Strategies of Veteran Women Software Developers. arXiv:2302.03723 [cs.SE]
- Sahil Verma and Julia Rubin. 2018. Fairness Definitions Explained. In Proceedings of the International Workshop on Software Fairness (Gothenburg, Sweden) (FairWare '18). Association for Computing Machinery, New York, NY, USA, 1–7. https://doi.org/10.1145/3194770.3194776
- How does machine learning change software development practices? IEEE Transactions on Software Engineering 47, 9 (2019), 1857–1871.
- Fine-tune bert for docred with two-step process. arXiv preprint arXiv:1909.11898 (2019).
- Human-in-the-loop Machine Learning: A Macro-Micro Perspective. arXiv:2202.10564 [cs.HC]
- Bugram: bug detection with n-gram language models. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. 708–719.
- Data complexity-based batch sanitization method against poison in distributed learning. Digital Communications and Networks (2022). https://doi.org/10.1016/j.dcan.2022.12.001
- Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective. arXiv:2112.06409 [cs.LG]
- AI model transferability in healthcare: a sociotechnical perspective. Nature Machine Intelligence 4, 10 (2022), 807–809.
- A survey of human-in-the-loop for machine learning. Future Generation Computer Systems 135 (oct 2022), 364–381. https://doi.org/10.1016/j.future.2022.05.014
- On Code Reuse from StackOverflow: An Exploratory Study on Jupyter Notebook. arXiv preprint arXiv:2302.11732 (2023).
- BiasRV: Uncovering Biased Sentiment Predictions at Runtime. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 1540–1544. https://doi.org/10.1145/3468264.3473117
- Revisiting Neuron Coverage Metrics and Quality of Deep Neural Networks. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE Computer Society, Los Alamitos, CA, USA, 408–419.
- What Do Users Ask in Open-Source AI Repositories? An Empirical Study of GitHub Issues. In Proceedings of the 20th International Conference on Mining Software Repositories (MSR '23). 12 pages.
- AI-Generated Images as Data Source: The Dawn of Synthetic Era. arXiv:2310.01830 [cs.CV]
- Fahri Anıl Yerlikaya and Şerif Bahtiyar. 2022. Data poisoning attacks against machine learning algorithms. Expert Systems with Applications 208 (2022), 118101.
- Xue Ying. 2019. An Overview of Overfitting and its Solutions. Journal of Physics: Conference Series 1168 (02 2019), 022022. https://doi.org/10.1088/1742-6596/1168/2/022022
- Differentially Private Fine-tuning of Language Models. arXiv:2110.06500 [cs.LG]
- Large Scale Private Learning via Low-rank Reparametrization. arXiv:2106.09352 [cs.LG]
- Metaformer baselines for vision. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
- Architecture Decisions in AI-based Systems Development: An Empirical Study. arXiv:2212.13866 [cs.SE]
- Perturbed model validation: A new framework to validate model relevance. (2019).
- Machine Learning Testing: Survey, Landscapes and Horizons. arXiv:1906.10742 [cs.LG]
- The secret revealer: Generative model-inversion attacks against deep neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 253–261.
- SeqFuzzer: An industrial protocol fuzzing framework from a deep learning perspective. In 2019 12th IEEE Conference on software testing, validation and verification (ICST). IEEE, 59–67.