Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dataset Factory: A Toolchain For Generative Computer Vision Datasets (2309.11608v1)

Published 20 Sep 2023 in cs.AI

Abstract: Generative AI workflows heavily rely on data-centric tasks - such as filtering samples by annotation fields, vector distances, or scores produced by custom classifiers. At the same time, computer vision datasets are quickly approaching petabyte volumes, rendering data wrangling difficult. In addition, the iterative nature of data preparation necessitates robust dataset sharing and versioning mechanisms, both of which are hard to implement ad-hoc. To solve these challenges, we propose a "dataset factory" approach that separates the storage and processing of samples from metadata and enables data-centric operations at scale for machine learning teams and individual researchers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (13)
  1. Contrastive language and vision learning of general fashion concepts. Scientific Reports, 12(1):18958, Nov 2022. ISSN 2045-2322. doi: 10.1038/s41598-022-23052-9. URL https://doi.org/10.1038/s41598-022-23052-9.
  2. DataComp. Datacomp leaderboard, 2023. URL https://www.datacomp.ai/leaderboard.html.
  3. DataComp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  4. Microsoft COCO: Common objects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312.
  5. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, October 2021.
  6. Wes McKinney. Apache Arrow and the “10 things I hate about pandas”, 2017. URL https://wesmckinney.com/blog/apache-arrow-pandas-internals/.
  7. What can data-centric AI learn from data and ML engineering? arXiv preprint arXiv:2112.06439, 2021.
  8. Tracking dubious data: Protecting scientific workflows from invalidated experiments. In 2022 IEEE 18th International Conference on e-Science (e-Science), pages 456–461, 2022. doi: 10.1109/eScience55777.2022.00082.
  9. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  10. LAION-5B: An open large-scale dataset for training next generation image-text models, 2022.
  11. A survey of data provenance techniques. Technical Report IUB-CS-TR618, Computer Science Department, Indiana University, Bloomington, IN, 2005.
  12. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  13. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12104–12113, June 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com