Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What About the Data? A Mapping Study on Data Engineering for AI Systems (2402.05156v1)

Published 7 Feb 2024 in cs.DL, cs.AI, and cs.DB

Abstract: AI systems cannot exist without data. Now that AI models (data science and AI) have matured and are readily available to apply in practice, most organizations struggle with the data infrastructure to do so. There is a growing need for data engineers that know how to prepare data for AI systems or that can setup enterprise-wide data architectures for analytical projects. But until now, the data engineering part of AI engineering has not been getting much attention, in favor of discussing the modeling part. In this paper we aim to change this by perform a mapping study on data engineering for AI systems, i.e., AI data engineering. We found 25 relevant papers between January 2019 and June 2023, explaining AI data engineering activities. We identify which life cycle phases are covered, which technical solutions or architectures are proposed and which lessons learned are presented. We end by an overall discussion of the papers with implications for practitioners and researchers. This paper creates an overview of the body of knowledge on data engineering for AI. This overview is useful for practitioners to identify solutions and best practices as well as for researchers to identify gaps.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Data engineering for hpc with python. In 2020 IEEE/ACM 9th Workshop on Python for High-Performance and Scientific Computing (PyHPC). IEEE, 13–21.
  2. Data Sovereignty for AI Pipelines: Lessons Learned from an Industrial Project at Mondragon Corporation. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI (Pittsburgh, Pennsylvania) (CAIN ’22). Association for Computing Machinery, New York, NY, USA, 193–204. https://doi.org/10.1145/3522664.3528593
  3. Sa Amershi. 2019. Software Engineering for Machine Learning Applications. Icse 2020 (2019), 1–10. https://fontysblogt.nl/software-engineering-for-machine-learning-applications/
  4. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300.
  5. Harvinder Atwal. 2019. Practical DataOps: Delivering agile data science at scale. Springer.
  6. Shelernaz Azimi and Claus Pahl. 2021. AI Quality Engineering for Machine Learning Based IoT Data Processing. In International Conference on Cloud Computing and Services Science. Springer, 69–87.
  7. Engineering AI systems: A research agenda. Artificial Intelligence Paradigms for Smart Cyber-Physical Systems (2021), 1–19.
  8. Data Validation for Machine Learning.. In MLSys.
  9. On the Role of Data Engineering Decisions in AI-Based Applications.. In REFSQ Workshops.
  10. Qi Cheng and Guodong Long. 2022. Federated Learning Operations (FLOps): Challenges, Lifecycle and Approaches. In 2022 International Conference on Technologies and Applications of Artificial Intelligence (TAAI). 12–17. https://doi.org/10.1109/TAAI57707.2022.00012
  11. Continuous Deployment of Machine Learning Pipelines.. In EDBT. 397–408.
  12. Danny Farah. 2020. The Modern MLOps Blueprint. online. https://medium.com/slalom-data-analytics/the-modern-mlops-blueprint-c8322af69d21
  13. AI system engineering—key challenges and lessons learned. Machine Learning and Knowledge Extraction 3, 1 (2020), 56–83.
  14. Data smells: categories, causes and consequences, and detection of suspicious data in AI-based systems. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI. 229–239.
  15. AI pro: Data processing framework for AI models. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1980–1983.
  16. Christoph Gröger. 2021. There is no AI without data. Commun. ACM 64, 11 (2021), 98–108.
  17. Constanze Hasterok and Janina Stompe. 2022. PAISE®–process model for AI systems engineering. at-Automatisierungstechnik 70, 9 (2022), 777–786.
  18. Data Fabric and Data Mesh for the AI Lifecycle. In Data Fabric and Data Mesh Approaches with AI: A Guide to AI-based Data Cataloging, Governance, Integration, Orchestration, and Consumption. Springer, 195–228.
  19. A compositional approach to creating architecture frameworks with an application to distributed AI systems. Journal of Systems and Software 198 (2023), 111604.
  20. Data Quality for AI Tool: Exploratory Data Analysis on IBM API. International Journal of Intelligent Systems and Applications 14, 1 (2022), 42.
  21. The Principles of Data-Centric AI. Commun. ACM 66, 8 (jul 2023), 84–92. https://doi.org/10.1145/3571724
  22. B. Kitchenham and S Charters. 2007. Guidelines for performing systematic literature reviews in software engineering. Technical Report. EBSE-2007-01.
  23. Machine learning operations (mlops): Overview, definition, and architecture. IEEE Access (2023).
  24. DevOps for AI–Challenges in Development of AI-enabled Applications. In 2020 International Conference on Software, Telecommunications and Computer Networks (SoftCOM). IEEE, 1–6.
  25. On the Experiences of Adopting Automated Data Validation in an Industrial Machine Learning Project. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 248–257. https://doi.org/10.1109/ICSE-SEIP52600.2021.00034
  26. Combining Data-Driven and Knowledge-Based AI Paradigms for Engineering AI-Based Safety-Critical Systems. In Workshop on Artificial Intelligence Safety (SafeAI).
  27. What is an AI engineer? An empirical analysis of job ads in The Netherlands. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI. 136–144.
  28. Data management for production quality deep learning models: Challenges and solutions. Journal of Systems and Software 191, 111359.
  29. From ad-hoc data analytics to dataops. In Proceedings of the International Conference on Software and System Processes. 165–174.
  30. A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners. In 2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN). IEEE Computer Society, Los Alamitos, CA, USA, 171–183. https://doi.org/10.1109/CAIN58948.2023.00034
  31. LAOps: Learning Analytics with Privacy-aware MLOps. In International Conference on Computer Supported Education, CSEDU. Science and Technology Publications (SciTePress), 213–220.
  32. Ipek Ozkaya. 2020. What is really different in engineering AI-enabled systems? IEEE software 37, 4 (2020), 3–6.
  33. An empirical evaluation of flow based programming in the machine learning deployment context. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI. 54–64.
  34. Towards a Data Engineering Process in Data-Driven Systems Engineering. In 2022 IEEE International Symposium on Systems Engineering (ISSE). IEEE, 1–8.
  35. On the Impact of ML use cases on Industrial Data Pipelines. In 2021 28th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 463–472.
  36. Joe Reis and Matt Housley. 2022. Fundamentals of Data Engineering. O’Reilly.
  37. Scalable modular synthetic data generation for advancing aerial autonomy. Robotics and Autonomous Systems 166 (2023), 104464.
  38. Johnny Saldaña. 2011. The Coding Manual for Qualitative Researchers (2nd editio ed.). SAGE Publications Inc. 329 pages.
  39. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
  40. Taming data quality in AI-enabled industrial internet of things. IEEE Software 39, 6 (2022), 35–42.
  41. Adoption and effects of software engineering best practices in machine learning. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–12.
  42. Karthik Shivashankar and Antonio Martini. 2022. Maintainability Challenges in ML: A Systematic Literature Review. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 60–67.
  43. Data smells in public datasets. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI. 205–216.
  44. DERM: A Reference Model for Data Engineering.
  45. Automated Annotations for AI Data and Model Transparency. J. Data and Information Quality 14, 1, Article 2 (dec 2021), 9 pages. https://doi.org/10.1145/3460000
  46. The Construction Techniques of Artificial Intelligence Hierarchical Dataset in Power Industry. In 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC), Vol. 6. IEEE, 320–325.
  47. Stephen John Warnett and Uwe Zdun. 2022. Architectural design decisions for the machine learning workflow. Computer 55, 3 (2022), 40–51.
  48. Claes Wohlin. 2014. Guidelines for Snowballing in Systematic Literature Studies and a Replication in Software Engineering. In Proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE). London (UK), Article 38, 10 pages.
  49. Haruki Yokoyama. 2019. Machine learning system architectural pattern for improving operational stability. In 2019 IEEE International Conference on Software Architecture Companion (ICSA-C). IEEE, 267–274.
  50. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering (2020).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Petra Heck (6 papers)

Summary

We haven't generated a summary for this paper yet.