A Flexible Cell Classification for ML Projects in Jupyter Notebooks (2403.07562v1)
Abstract: Jupyter Notebook is an interactive development environment commonly used for rapid experimentation of ML solutions. Describing the ML activities performed along code cells improves the readability and understanding of Notebooks. Manual annotation of code cells is time-consuming and error-prone. Therefore, tools have been developed that classify the cells of a notebook concerning the ML activity performed in them. However, the current tools are not flexible, as they work based on look-up tables that have been created, which map function calls of commonly used ML libraries to ML activities. These tables must be manually adjusted to account for new or changed libraries. This paper presents a more flexible approach to cell classification based on a hybrid classification approach that combines a rule-based and a decision tree classifier. We discuss the design rationales and describe the developed classifiers in detail. We implemented the new flexible cell classification approach in a tool called JupyLabel. Its evaluation and the obtained metric scores regarding precision, recall, and F1-score are discussed. Additionally, we compared JupyLabel with HeaderGen, an existing cell classification tool. We were able to show that the presented flexible cell classification approach outperforms this tool significantly.
- A. X. Zhang, M. Muller, and D. Wang, “How Do Data Science Workers Collaborate? Roles, Workflows, and Tools,” Proceedings of the ACM on Human-Computer Interaction, vol. 4, no. CSCW1, pp. 1–23, May 2020.
- J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire, “A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks,” in 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), 2019, pp. 507–517.
- A. Rule, A. Birmingham, C. Zuñiga, I. Altintas, S. Huang, R. Knight, N. Moshiri, M. H. Nguyen, S. B. Rosenthal, F. Pérez, and P. W. Rose, “Ten Simple Rules for Reproducible Research in Jupyter Notebooks,” CoRR, vol. abs/1810.08055, 2018. [Online]. Available: http://arxiv.org/abs/1810.08055
- A. Rule, A. Tabard, and J. D. Hollan, “Exploration and Explanation in Computational Notebooks,” in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, ser. CHI ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 1–12. [Online]. Available: https://doi.org/10.1145/3173574.3173606
- J. Wenskovitch, J. Zhao, S. Carter, M. Cooper, and C. North, “Albireo: An Interactive Tool for Visually Summarizing Computational Notebook Structure,” in 2019 IEEE Visualization in Data Science (VDS), 2019, pp. 1–10.
- T. Guggulothu, “Code Smell Detection Using Multilabel Classification Approach,” CoRR, vol. abs/1902.03222, 2019. [Online]. Available: http://arxiv.org/abs/1902.03222
- S. Ugurel, R. Krovetz, and C. L. Giles, “What’s the Code? Automatic Classification of Source Code Archives,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp. 632–638.
- M. I. Azeem, F. Palomba, L. Shi, and Q. Wang, “Machine Learning Techniques for Code Smell Detection: A Systematic Literature Review and Meta-Analysis,” Information and Software Technology, vol. 108, pp. 115–138, 2019.
- F. Khomh, S. Vaucher, Y.-G. Guéhéneuc, and H. Sahraoui, “A Bayesian Approach for the Detection of Code and Design Smells,” in 2009 Ninth International Conference on Quality Software, 2009, pp. 305–314.
- N. Maneerat and P. Muenchaisri, “Bad-Smell Prediction from Software Design Model Using Machine Learning Techniques,” in 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE), 2011, pp. 331–336.
- X. Li and B. Liu, “Rule-Based Classification,” in Data Classification: Algorithms and Applications, C. C. Aggarwal, Ed. CRC Press, 2014, pp. 121–156. [Online]. Available: http://www.crcnetbase.com/doi/abs/10.1201/b17320-6
- L. Amorim, E. Costa, N. Antunes, B. Fonseca, and M. Ribeiro, “Experience Report: Evaluating the Effectiveness of Decision Trees for Detecting Code Smells,” in 2015 IEEE 26th international symposium on software reliability engineering (ISSRE). IEEE, 2015, pp. 261–269.
- R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. Kegelmeyer, “A Comparison of Decision Tree Ensemble Creation Techniques,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 173–180, 2007.
- F. Arcelli Fontana, M. V. Mäntylä, M. Zanoni, and A. Marino, “Comparing and Experimenting Machine Learning Techniques for Code Smell Detection,” Empirical Software Engineering, vol. 21, pp. 1143–1191, 2016.
- A. S. Venkatesh, J. Wang, L. Li, and E. Bodden, “Enhancing Comprehension and Navigation in Jupyter Notebooks with Static Analysis,” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). Los Alamitos, CA, USA: IEEE Computer Society, Mar 2023, pp. 391–401. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/SANER56733.2023.00044
- L. Quaranta, F. Calefato, and F. Lanubile, “KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle,” in 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17-19, 2021. IEEE, 2021, pp. 550–554. [Online]. Available: https://doi.org/10.1109/MSR52588.2021.00072
- Y. Jiang, C. Kästner, and S. Zhou, “Elevating Jupyter Notebook Maintenance Tooling by Identifying and Extracting Notebook Structures,” in 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2022, pp. 399–403.
- R. Wirth and J. Hipp, “CRISP-DM: Towards a Standard Process Model for Data Mining,” in Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining. Manchester, Apr 2000, pp. 29–39.
- “Scikit CountVectorizer,” (Date last accessed 15-May-2023). [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
- V. R. Joseph, “Optimal Ratio for Data Splitting,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 15, no. 4, pp. 531–538, 2022.
- “Scikit Utils Resample,” (Date last accessed 28-Oct-2023). [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html
- F. Pecorelli, D. Di Nucci, C. De Roover, and A. De Lucia, “On the Role of Data Balancing for Machine Learning-Based Code Smell Detection,” in Proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation, 2019, pp. 19–24.
- D. Ramasamy, C. Sarasua, A. Bacchelli, and A. Bernstein, “Workflow Analysis of Data Science Code in Public Github Repositories,” Empirical Software Engineering, vol. 28, no. 1, p. 7, 2023.
- A. K. Uysal and S. Gunal, “The Impact of Preprocessing on Text Classification,” Information processing & management, vol. 50, no. 1, pp. 104–112, 2014.
- M. B. Kery and B. A. Myers, “Interactions for Untangling Messy History in a Computational Notebook,” in 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 2018, pp. 147–155.
- W. F. Tichy, “Where’s the Science in Software Engineering? Ubiquity Symposium: The Science in Computer Science,” Ubiquity, vol. 2014, no. March, 2014. [Online]. Available: https://doi.org/10.1145/2590528.2590529