Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors (2211.05920v2)

Published 10 Nov 2022 in cs.SE and cs.LG

Abstract: Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (142)
  1. An empirical study based on semi-supervised hybrid self-organizing map for software fault prediction. Knowledge-Based Systems, 74:28–39, 2015.
  2. Steven Abney. Bootstrapping. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 360–367, 2002.
  3. ”better data” is better than ”better data miners” (benefits of tuning SMOTE for defect prediction). CoRR, abs/1705.03697, 2017.
  4. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Software Engineering (ICSE), 2011 33rd International Conference on, pages 1–10. IEEE, 2011.
  5. Eric Bair. Semi-supervised clustering methods. Wiley Interdisciplinary Reviews: Computational Statistics, 5(5):349–361, 2013.
  6. Co-training and expansion: Towards bridging theory and practice. Advances in neural information processing systems, 17, 2004.
  7. Software defect prediction using ensemble learning: an anp based evaluation method. FUOYE J. Eng. Technol, 3(2):50–55, 2018.
  8. The limited impact of individual developer data on software defect prediction. Empirical Software Engineering, 18(3):478–505, 2013.
  9. Exploiting unlabeled data in ensemble methods. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 289–296, 2002.
  10. Empirical evaluation of cross-release effort-aware defect prediction models. In 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), pages 214–221. IEEE, 2016.
  11. Putting it all together: Using socio-technical networks to predict failures. In ISSRE, 2009.
  12. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100, 1998.
  13. Developing interpretable models with optimized set reduction for identifying high-risk software components. IEEE Transactions on Software Engineering, 19(11):1028–1044, 1993.
  14. An improved twin support vector machine based on multi-objective cuckoo search for software defect prediction. International Journal of Bio-Inspired Computation, 11(4):282–291, 2018.
  15. G. Catolino. Just-in-time bug prediction in mobile applications: The domain matters! In MOBILESoft, 2017.
  16. Gemma Catolino. Just-in-time bug prediction in mobile applications: the domain matters! In 2017 IEEE/ACM 4th International Conference on Mobile Software Engineering and Systems (MOBILESoft), pages 201–202. IEEE, 2017.
  17. Semi-supervised classification by low density separation. In International workshop on artificial intelligence and statistics, pages 57–64. PMLR, 2005.
  18. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  19. Replication can improve prior results: A github study of pull request acceptance. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pages 179–190. IEEE, 2019.
  20. Multidimensional scaling. In Handbook of data visualization, pages 315–347. Springer, 2008.
  21. Semi-supervised clustering using genetic algorithms. Artificial neural networks in engineering (ANNIE-99), pages 809–814, 1999.
  22. When does cotraining work in real data? IEEE Transactions on Knowledge and Data Engineering, 23(5):788–799, 2010.
  23. Feature selection using decision tree induction in class level metrics dataset for software defect predictions. In Proceedings of the world congress on engineering and computer science, volume 1, pages 124–129, 2010.
  24. Revisiting the impact of classification techniques on the performance of defect prediction models. In 2015 37th ICSE, 2015.
  25. Revisiting the impact of classification techniques on the performance of defect prediction models. In 37th ICSE-Volume 1, pages 789–800. IEEE Press, 2015.
  26. A large-scale study of the impact of feature selection techniques on defect classification models. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pages 146–157. IEEE, 2017.
  27. Multi-manifold semi-supervised learning. In Artificial intelligence and statistics, pages 169–176. PMLR, 2009.
  28. Enhancing supervised learning with unlabeled data. In ICML, pages 327–334. Citeseer, 2000.
  29. Empirical evaluation of the impact of class overlap on software defect prediction. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 698–709. IEEE, 2019.
  30. Somya Goyal. Handling class-imbalance with knn (neighbourhood) under-sampling for software defect prediction. Artificial Intelligence Review, 55(3):2023–2064, 2022.
  31. Common trends in software fault and failure data. IEEE Transactions on Software Engineering, 35(4):484–496, 2009.
  32. Software defect prediction using semi-supervised learning with change burst information. In 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), volume 1, pages 113–122. IEEE, 2016.
  33. An investigation on the feasibility of cross-project defect prediction. Automated Software Engineering, 19(2):167–199, 2012.
  34. On the naturalness of software. In 2012 34th ICSE (ICSE), pages 837–847. IEEE, 2012.
  35. What do large commits tell us? a taxonomical study of large commits. In Proceedings of the 2008 international working conference on Mining software repositories, pages 99–108, 2008.
  36. A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Information and Software Technology, 95:296–312, 2018.
  37. Supervised vs unsupervised models: A holistic look at effort-aware just-in-time defect prediction. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), volume 00, pages 159–170, Sept. 2017.
  38. Co-clustering of multi-view datasets. Knowledge and Information Systems, 47(3):545–570, 2016.
  39. Software defect prediction using feature selection and random forest algorithm. In 2017 International Conference on New Trends in Computing Sciences (ICTCS), pages 252–257. IEEE, 2017.
  40. An hmm-based multi-view co-training framework for single-view text corpora. In Hybrid Artificial Intelligent Systems: 11th International Conference, HAIS 2016, Seville, Spain, April 18-20, 2016, Proceedings 11, pages 66–78. Springer, 2016.
  41. Performance analysis of machine learning techniques on software defect prediction using nasa datasets. International Journal of Advanced Computer Science and Applications, 10(5), 2019.
  42. Shomona Gracia Jacob et al. Improved random forest algorithm for software defect prediction through data mining techniques. International Journal of Computer Applications, 117(23), 2015.
  43. Graph construction and b-matching for semi-supervised learning. In Proceedings of the 26th annual international conference on machine learning, pages 441–448, 2009.
  44. The promises and perils of mining github. MSR 2014. ACM, 2014.
  45. A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering, 39(6):757–773, 2012.
  46. Detecting false alarms from automatic static analysis tools: How far are we? In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 698–709, New York, NY, USA, 2022. Association for Computing Machinery.
  47. Remi: defect prediction for efficient api testing. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 990–993, 2015.
  48. An empirical investigation into the role of api-level refactorings during software evolution. In Proceedings of the 33rd ICSE, pages 151–160. ACM, 2011.
  49. Classifying software changes: Clean or buggy? TSE, 2008.
  50. Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering, 34(2):181–196, 2008.
  51. Building effective defect-prediction models in practice. IEEE software, 22(6):23–29, 2005.
  52. An investigation into the functional form of the size-defect relationship for software modules. IEEE Transactions on Software Engineering, 35(2):293–304, 2008.
  53. Equilibrium-based support vector machine for semisupervised classification. IEEE Transactions on Neural Networks, 18(2):578–583, 2007.
  54. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 19(2):201–230, 2012.
  55. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 37(6):1088–1098, 2007.
  56. Effort-aware semi-supervised just-in-time defect prediction. Information and Software Technology, 126:106364, 2020.
  57. Towards making unlabeled data never hurt. IEEE transactions on pattern analysis and machine intelligence, 37(1):175–188, 2014.
  58. Why power laws? an explanation from fine-grained code changes. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pages 68–75, 2015.
  59. Adaptive co-training svm for sentiment classification on tweets. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 2079–2088, 2013.
  60. Software defect prediction using semi-supervised learning with dimension reduction. In 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pages 314–317. IEEE, 2012.
  61. Parameter tuning in knn for software defect prediction: an empirical analysis. 2019.
  62. Revisiting process versus product metrics: a large scale analysis. Empirical Software Engineering, 27(3):1–42, 2022.
  63. Semiboost: Boosting for semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 31(11):2000–2014, 2008.
  64. An analysis of developer metrics for fault prediction. In 6th PROMISE, 2010.
  65. Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engineering, 17(4):375–407, 2010.
  66. Ai-based software defect predictors: Applications and benefits in a case study. AI Magazine, 32(2):57–68, 2011.
  67. Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Transactions on software engineering, 39(4):537–551, 2013.
  68. Identifying reasons for software changes using historic databases. In icsm, pages 120–130, 2000.
  69. Curating github for engineered software projects. Empirical Software Engineering, 22(6):3219–3253, Dec 2017.
  70. Change bursts as defect predictors. In 2010 IEEE 21st international symposium on software reliability engineering, pages 309–318. IEEE, 2010.
  71. Heterogeneous defect prediction. IEEE TSE, 2018.
  72. Heterogeneous defect prediction. IEEE Transactions on Software Engineering, 44(9):874–896, 2018.
  73. Transfer defect learning. In Software Engineering (ICSE), 2013 35th International Conference on, pages 382–391. IEEE, 2013.
  74. Software defect prediction using bayesian networks. Empirical Software Engineering, 19(1):154–181, 2014.
  75. Where the bugs are. ACM SIGSOFT Software Engineering Notes, 29(4):86–96, 2004.
  76. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2010.
  77. Applying novel resampling strategies to software defect prediction. In NAFIPS 2007-2007 Annual meeting of the North American fuzzy information processing society, pages 69–72. IEEE, 2007.
  78. How, and why, process metrics are better. In Proceedings of the 2013 ICSE, pages 432–441. IEEE Press, 2013.
  79. How, and why, process metrics are better. In Proceedings of the 2013 ICSE, ICSE ’13, page 432–441. IEEE Press, 2013.
  80. Comparing static bug finders and statistical prediction. In Proceedings of the 36th ICSE, pages 424–434. ACM, 2014.
  81. On the” naturalness” of buggy code. In Proceedings of the 38th International Conference on Software Engineering, pages 428–439, 2016.
  82. Commit guru: Analytics and risk prediction of software commits. ESEC/FSE 2015, 2015.
  83. Commit guru: analytics and risk prediction of software commits. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 966–969. ACM, 2015.
  84. Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empirical Software Engineering, 21(1):43–71, 2016.
  85. Henry Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 11(3):363–371, 1965.
  86. An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Information Sciences, 259:571–595, 2014.
  87. Predicting faults in high assurance software. In 2010 IEEE 12th International Symposium on High Assurance Systems Engineering, pages 26–34. IEEE, 2010.
  88. Data quality: Some comments on the nasa software defect datasets. Software Engineering, IEEE Transactions on, 39:1208–1215, 09 2013.
  89. Y. Shin and L. Williams. Can traditional fault prediction models be used for vulnerability prediction? EMSE, 2013.
  90. Software defect prediction analysis using machine learning algorithms. In 2017 7th International Conference on Cloud Computing, Data Science & Engineering-Confluence, pages 775–781. IEEE, 2017.
  91. Less than one’-shot learning: Learning n classes from m¡ n samples. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9739–9746, 2021.
  92. Using coding-based ensemble learning to improve software defect prediction. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6):1806–1817, 2012.
  93. Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics, 8(1):355–370, 2017.
  94. The impact of automated parameter optimization on defect prediction models. IEEE Transactions on Software Engineering, pages 1–1, 2018.
  95. Automated parameter optimization of classification techniques for defect prediction models. In ICSE 2016, pages 321–332. ACM, 2016.
  96. An empirical comparison of model validation techniques for defect prediction models. IEEE Transactions on Software Engineering, 43(1):1–18, 2016.
  97. Survey on software defect prediction techniques. International Journal of Applied Science and Engineering, 17(4):331–344, 2020.
  98. A comparison on multi-class classification methods based on least squares twin support vector machine. Knowledge-Based Systems, 81:131–147, 2015.
  99. Ai-based software defect predictors: Applications and benefits in a case study. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1748–1755, 2010.
  100. FRUGAL: unlocking SSL for software analytics. CoRR, abs/2108.09847, 2021.
  101. Better data labelling with emblem (and how that impacts defect prediction). IEEE Transactions on Software Engineering, 2020.
  102. A survey on semi-supervised learning. Machine Learning, 109(2):373–440, 2020.
  103. Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.
  104. B Vasilescu. Personnel communication at fse’18. Found. Softw. Eng, 2018.
  105. Quality and productivity outcomes relating to continuous integration in github. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 805–816. ACM, 2015.
  106. A comparison framework of classification models for software defect prediction. Advanced Science Letters, 20(10-11):1945–1950, 2014.
  107. Perceptions, expectations, and challenges in defect prediction. IEEE Transactions on Software Engineering, 46(11):1241–1266, 2018.
  108. Compressed c4. 5 models for software defect prediction. In 2012 12th International Conference on Quality Software, pages 13–16. IEEE, 2012.
  109. Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 62(2):434–443, 2013.
  110. Szz revisited: verifying when changes induce fixes. In Proceedings of the 2008 workshop on Defects in large software systems, pages 32–36. ACM, 2008.
  111. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987.
  112. Data quality matters: A case study on data label correctness for security bug report prediction. IEEE Transactions on Software Engineering, 48(7):2541–2556, 2022.
  113. Collective personalized change classification with multiobjective search. IEEE Transactions on Reliability, 65(4):1810–1829, 2016.
  114. Community detection using a neighborhood strength driven label propagation algorithm. In 2011 IEEE Network Science Workshop, pages 188–195. IEEE, 2011.
  115. A survey on multi-view learning. arXiv preprint arXiv:1304.5634, 2013.
  116. The impact of feature selection on defect prediction performance: An empirical comparison. In 2016 IEEE 27th international symposium on software reliability engineering (ISSRE), pages 309–320. IEEE, 2016.
  117. Tlel: A two-layer ensemble learning approach for just-in-time defect prediction. Information and Software Technology, 87:206–220, 2017.
  118. Deep learning for just-in-time defect prediction. In 2015 IEEE International Conference on Software Quality, Reliability and Security, pages 17–26. IEEE, 2015.
  119. Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, pages 157–168, 2016.
  120. David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, pages 189–196, 1995.
  121. Identifying self-admitted technical debts with jitterbug: A two-step approach. IEEE Transactions on Software Engineering, 48(5):1676–1691, 2022.
  122. Improving vulnerability inspection efficiency using active learning. IEEE Transactions on Software Engineering, 47(11):2401–2420, 2019.
  123. Question classification based on co-training style semi-supervised learning. Pattern Recognition Letters, 31(13):1975–1980, 2010.
  124. Data transformation in cross-project defect prediction. Empirical Software Engineering, 22(6):3186–3218, 2017.
  125. Towards building a universal defect prediction model with rank transformed predictors. Empirical Softw. Engg., 21(5):2107–2145, October 2016.
  126. Cross-project defect prediction using a connectivity-based unsupervised classifier. In 2016 IEEE/ACM 38th ICSE (ICSE), pages 309–320. IEEE, 2016.
  127. Predicting defective software components from code complexity measures. In 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007), pages 93–96. IEEE, 2007.
  128. Bigfuzz: Efficient fuzz testing for data analytics using framework abstraction. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pages 722–733, 2020.
  129. Effort-aware tri-training for semi-supervised just-in-time defect prediction. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 293–304. Springer, 2019.
  130. Label propagation based semi-supervised learning for software defect prediction. Automated Software Engineering, 24(1):47–69, 2017.
  131. Unsupervised learning for expert-based software quality estimation. In HASE, pages 149–155. Citeseer, 2004.
  132. Learning with local and global consistency. Advances in neural information processing systems, 16, 2003.
  133. Democratic co-learning. In 16th IEEE International Conference on Tools with Artificial Intelligence, pages 594–602. IEEE, 2004.
  134. On the ability of complexity metrics to predict fault-prone classes in object-oriented systems. Journal of Systems and Software, 83(4):660–674, 2010.
  135. Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
  136. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on knowledge and Data Engineering, 17(11):1529–1541, 2005.
  137. Learning from labeled and unlabeled data with label propagation. ProQuest Number: INFORMATION TO ALL USERS, 2002.
  138. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.
  139. Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005.
  140. Cross-project defect prediction. In ESEC/FSE’09, August 2009.
  141. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In FSE, pages 91–100, 2009.
  142. Predicting defects for eclipse. In Proceedings of the Third International Workshop on Predictor Models in Software Engineering, page 9. IEEE Computer Society, 2007.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Suvodeep Majumder (11 papers)
  2. Joymallya Chakraborty (12 papers)
  3. Tim Menzies (128 papers)
Citations (1)