DataCI: A Platform for Data-Centric AI on Streaming Data (2306.15538v2)
Abstract: We introduce DataCI, a comprehensive open-source platform designed specifically for data-centric AI in dynamic streaming data settings. DataCI provides 1) an infrastructure with rich APIs for seamless streaming dataset management, data-centric pipeline development and evaluation on streaming scenarios, 2) an carefully designed versioning control function to track the pipeline lineage, and 3) an intuitive graphical interface for a better interactive user experience. Preliminary studies and demonstrations attest to the easy-to-use and effectiveness of DataCI, highlighting its potential to revolutionize the practice of data-centric AI in streaming data contexts.
- Models in the loop: Aiding crowdworkers with generative annotation assistants. arXiv preprint arXiv:2112.09062, 2021.
- Towards a platform and benchmark suite for model training on dynamic datasets. In Proceedings of the 3rd Workshop on Machine Learning and Systems, pp. 8–17, 2023.
- Modelci-e: Enabling continual learning in deep learning serving systems. arXiv preprint arXiv:2106.03122, 2021.
- Active-learning-as-a-service: an efficient mlops system for data-centric ai. arXiv preprint arXiv:2207.09109, 2022.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Modelps: An interactive and collaborative platform for editing pre-trained models at scale. arXiv preprint arXiv:2105.08275, 2021.
- A data-centric framework for composable NLP workflows. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 197–204, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.26. URL https://aclanthology.org/2020.emnlp-demos.26.
- Dataperf: Benchmarks for data-centric ai development. arXiv preprint arXiv:2207.10062, 2022.
- Ease.ml: A lifecycle management system for machine learning. In 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online Proceedings. www.cidrdb.org, 2021. URL http://cidrdb.org/cidr2021/papers/cidr2021_paper26.pdf.
- Adversarial nibbler: A data-centric challenge for improving the safety of text-to-image models. arXiv preprint arXiv:2305.14384, 2023.
- Automatic differentiation in pytorch. 2017.
- Rethinking streaming machine learning evaluation. arXiv preprint arXiv:2205.11473, 2022.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
- Dataclue: A benchmark suite for data-centric nlp. arXiv preprint arXiv:2111.08647, 2021.
- Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158, 2023.
- Mlmodelci: An automatic cloud platform for efficient mlaas. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 4453–4456, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Collections
Sign up for free to add this paper to one or more collections.