Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism (2403.01832v1)

Published 4 Mar 2024 in cs.AI and cs.CL

Abstract: This paper delves into the contrasting roles of data within academic and industrial spheres, highlighting the divergence between Data-Centric AI and Model-Agnostic AI approaches. We argue that while Data-Centric AI focuses on the primacy of high-quality data for model performance, Model-Agnostic AI prioritizes algorithmic flexibility, often at the expense of data quality considerations. This distinction reveals that academic standards for data quality frequently do not meet the rigorous demands of industrial applications, leading to potential pitfalls in deploying academic models in real-world settings. Through a comprehensive analysis, we address these disparities, presenting both the challenges they pose and strategies for bridging the gap. Furthermore, we propose a novel paradigm: Model-Based Data-Centric AI, which aims to reconcile these differences by integrating model considerations into data optimization processes. This approach underscores the necessity for evolving data requirements that are sensitive to the nuances of both academic research and industrial deployment. By exploring these discrepancies, we aim to foster a more nuanced understanding of data's role in AI development and encourage a convergence of academic and industrial standards to enhance AI's real-world applicability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Unlocking the tacit knowledge of data work in machine learning. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2023.
  2. Dmops: Data management operation and recipes. arXiv preprint arXiv:2301.01228, 2023.
  3. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
  4. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  5. Data quality toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv preprint arXiv:2108.05935, 2021.
  6. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):1–54, 2019.
  7. Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling, 2023a.
  8. Transcending traditional boundaries: Leveraging inter-annotator agreement (iaa) for enhancing data management operations (dmops). arXiv preprint arXiv:2306.14374, 2023b.
  9. Inter-annotator agreement in the wild: Uncovering its emerging roles and considerations in real-world scenarios. arXiv preprint arXiv:2306.14373, 2023.
  10. Self-refine: Iterative refinement with self-feedback, 2023.
  11. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  12. Curator: Creating large-scale curated labelled datasets using self-supervised learning. arXiv preprint arXiv:2212.14099, 2022.
  13. Improving scene text recognition for character-level long-tailed distribution. arXiv preprint arXiv:2304.08592, 2023.
  14. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
  15. Curriculum learning: A survey. International Journal of Computer Vision, 130(6):1526–1565, 2022.
  16. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  17. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  18. Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158, 2023.
  19. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.