Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The State of Documentation Practices of Third-party Machine Learning Models and Datasets (2312.15058v1)

Published 22 Dec 2023 in cs.SE and cs.LG

Abstract: Model stores offer third-party ML models and datasets for easy project integration, minimizing coding efforts. One might hope to find detailed specifications of these models and datasets in the documentation, leveraging documentation standards such as model and dataset cards. In this study, we use statistical analysis and hybrid card sorting to assess the state of the practice of documenting model cards and dataset cards in one of the largest model stores in use today--Hugging Face (HF). Our findings show that only 21,902 models (39.62\%) and 1,925 datasets (28.48\%) have documentation. Furthermore, we observe inconsistency in ethics and transparency-related documentation for ML models and datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (14)
  1. G. Rajbahadur, E. Tuck, L. Zi, D. Lin, B. Chen, Z. Jiang, and D. German, “Can I use this publicly available dataset to build commercial AI software?,” in 2022 arXiv:2111.02374v5.
  2. E. Aghajani, C. Nagy, O. L. Vega-Márquez, M. Linares-Vásquez, L. Moreno, G. Bavota, and M. Lanza, “Software documentation issues unveiled,” in 2019 IEEE/ACM 41st International Conference on Software Engineering, 2019.
  3. M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” in Proceedings of the 2nd Conference on Fairness, Accountability, and Transparency, 2019.
  4. A. McMillan-Major, S. Osei, J. D. Rodriguez, P. S. Ammanamanchi, S. Gehrmann, and Y. Jernite, “Reusable templates and guides for documenting datasets and models for natural language processing and generation: A case study of the HuggingFace and GEM data and model cards,” in Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics, 2021.
  5. N. Nahar, S. Zhou, G. Lewis, and C. Kästner, “Collaboration challenges in building ml-enabled systems: Communication, documentation, engineering, and process,” in Proceedings of the 44th International Conference on Software Engineering, 2022.
  6. L. Fischer, L. Ehrlinger, V. Geist, R. Ramler, F. Sobiezky, W. Zellinger, D. Brunner, M. Kumar, and B. Moser, “Ai system engineering—key challenges and lessons learned,” 2021 Machine Learning and Knowledge Extraction.
  7. J. Bandy and N. Vincent, “Addressing "documentation debt" in machine learning: A retrospective datasheet for bookcorpus,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks
  8. A. Bhat, A. Coursey, G. Hu, S. Li, N. Nahar, S. Zhou, C. Kastner, and J. Guo, “Aspirations and Practice of ML Model Documentation: Moving the Needle with Nudging and Traceability,” in 2023 Proceedings of the CHI Conference on Human Factors in Computing Systems.
  9. S. Beyer and M. Pinzger, “A manual categorization of android app development issues on stack overflow,” in 2014 IEEE International Conference on Software Maintenance and Evolution.
  10. E. Aghajani, C. Nagy, M. Linares-Vásquez, L. Moreno, G. Bavota, M. Lanza, and D. C. Shepherd, “Software documentation: The practitioners’ perspective,” in 2020 IEEE/ACM 42nd International Conference on Software Engineering.
  11. R. Croft, M. Babar, and M. Kholoosi, “Data Quality for Software Vulnerability Datasets” in 2023 IEEE/ACM 45th International Conference on Software Engineering.
  12. H. Tang, and S. Nadi, “Evaluating Software Documentation Quality,” in 2023 IEEE/ACM 20th International Conference on Mining Software Repositories .
  13. A. Paleyes, R.-G. Urma, and N. Lawrence, “Challenges in deploying machine learning: A survey of case studies,” in Proceedings of the Working on ML-Retrospectives, Surveys & Meta-Analyses (ML-RSA), co-located with the 34th Conference on Neural Information Processing Systems.
  14. W. Jiang, N. Synovic, M. Hyatt, T. Schorlemmer, R. Sethi, Y. Lu and G. Thiruvathukal, J. Davis, “An empirical study of pre-trained model reuse in the hugging face deep learning model registry,” in 2023 IEEE/ACM 45th International Conference on Software Engineering.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ernesto Lang Oreamuno (1 paper)
  2. Rohan Faiyaz Khan (1 paper)
  3. Abdul Ali Bangash (7 papers)
  4. Catherine Stinson (6 papers)
  5. Bram Adams (47 papers)
Citations (2)