Predicting Louisiana Public High School Dropout through Imbalanced Learning Techniques (1910.13018v1)

Published 29 Oct 2019 in cs.LG and stat.ML

Abstract: This study is motivated by the magnitude of the problem of Louisiana high school dropout and its negative impacts on individual and public well-being. Our goal is to predict students who are at risk of high school dropout, by examining Louisiana administrative dataset. Due to the imbalanced nature of the dataset, imbalanced learning techniques including resampling, case weighting, and cost-sensitive learning have been applied to enhance the prediction performance on the rare class. Performance metrics used in this study are F-measure, recall and precision of the rare class. We compare the performance of several machine learning algorithms such as neural networks, decision trees and bagging trees in combination with the imbalanced learning approaches using an administrative dataset of size of 366k+ from Louisiana Department of Education. Experiments show that application of imbalanced learning methods produces good results on recall but decreases precision, whereas base classifiers without regard of imbalanced data handling gives better precision but poor recall. Overall application of imbalanced learning techniques is beneficial, yet more studies are desired to improve precision.

Authors (2)

Marmar Orooji (3 papers)
Jianhua Chen (10 papers)

Citations (11)

View on Semantic Scholar

Summary

Enhanced Predictive Models for High School Dropout using Imbalanced Learning Techniques

Introduction to Dropout Prediction Challenge

Predicting high school dropout rates remains a crucial concern for educational institutions, particularly in regions like Louisiana, which has historically reported higher dropout rates. Preventive measures hinge on the ability to accurately identify at-risk students early on, enabling timely intervention. However, traditional machine learning models face a significant barrier in this pursuit due to the inherent imbalance present in dropout datasets - where the number of students who continue their education vastly outnumbers those who dropout. This paper undertakes the challenge of improving predictive accuracy on this minority class through the application of various imbalanced learning techniques.

Dataset and Preprocessing

The research utilized an extensive administrative dataset provided by the Louisiana Department of Education, comprising over 366k student records spanning from 1999-2000 to 2011-2012. The dataset includes various attributes such as enroLLMent details, disciplinary actions, demographics, and the critical binary target variable indicating dropout. Given the skewed nature, with only 4% of instances representing dropouts, the paper embarked on preprocessing to integrate and aggregate student records, ensuring a clean, comprehensive dataset for modeling.

Methodological Approach

The research explored multiple imbalanced learning strategies to enhance the minority class prediction. Notably:

Resampling Techniques: These methods balance the class distribution before training, including Random Down Sampling, Random Up Sampling, and the Synthetic Minority Over-sampling Technique (SMOTE), aiming to provide balanced input data for models.
Case Weighting and Cost-Sensitive Learning: Both approaches assign varying weights or costs to misclassifying different classes, emphasizing correct predictions for the minority class to foster better learning from imbalanced datasets.

Three machine learning algorithms were tested with these imbalanced learning techniques: Neural Networks, Decision Trees, and Bagging Trees, utilizing the R programming environment for implementation.

Evaluation Metrics and Classifier Performance

The paper focused on precision, recall, and the F-measure (with an emphasis on recall) for evaluation, given the importance of minimizing false negatives over false positives in dropout prediction. The experimentation revealed that:

Imbalanced learning methods significantly improved recall for the positive (dropout) class across all tested algorithms, though at the expense of precision.
Cost-Sensitive Learning and Case Weighting approaches were particularly effective, significantly outperforming standard classifiers on imbalanced data.
Among standard classifiers trained without consideration for data imbalance, ensemble methods like Bagging Trees showed promise in detecting the minority class.

Practical Implications and Future Research

This research advances the field of dropout prediction by demonstrating the effectiveness of imbalanced learning techniques in handling skewed datasets, which is critical for educational institutions aiming to identify and support at-risk students. The paper acknowledges the trade-off between recall and precision but justifies the increased recall's value given the high cost associated with failing to identify potential dropouts.

Future studies are encouraged to explore hybrid models and innovative resampling techniques further to optimize both precision and recall. Additionally, incorporating longitudinal data analysis could offer insights into temporal patterns of dropout risks, enhancing predictive performance and facilitating more targeted interventions.

Conclusion

In summary, this paper underscores the potential of imbalanced learning techniques to enhance the predictive accuracy of high school dropout models. By embracing these methods, educational stakeholders can better identify at-risk students, enabling more effective, timely interventions to reduce dropout rates and its associated negative societal impacts.

PDF Markdown

Related Papers

Find Related Papers