A Study on Feature Selection Techniques in Educational Data Mining (0912.3924v1)

Published 19 Dec 2009 in cs.DB

Abstract: Educational data mining (EDM) is a new growing research area and the essence of data mining concepts are used in the educational field for the purpose of extracting useful information on the behaviors of students in the learning process. In this EDM, feature selection is to be made for the generation of subset of candidate variables. As the feature selection influences the predictive accuracy of any performance model, it is essential to study elaborately the effectiveness of student performance model in connection with feature selection techniques. In this connection, the present study is devoted not only to investigate the most relevant subset features with minimum cardinality for achieving high predictive performance by adopting various filtered feature selection techniques in data mining but also to evaluate the goodness of subsets with different cardinalities and the quality of six filtered feature selection algorithms in terms of F-measure value and Receiver Operating Characteristics (ROC) value, generated by the NaiveBayes algorithm as base-line classifier method. The comparative study carried out by us on six filter feature section algorithms reveals the best method, as well as optimal dimensionality of the feature subset. Benchmarking of filter feature selection method is subsequently carried out by deploying different classifier models. The result of the present study effectively supports the well known fact of increase in the predictive accuracy with the existence of minimum number of features. The expected outcomes show a reduction in computational time and constructional cost in both training and classification phases of the student performance model.

PDF Abstract

An Analytical Perspective on Feature Selection Techniques in Educational Data Mining

The paper "A Study on Feature Selection Techniques in Educational Data Mining" by M. Ramaswami and R. Bhaskaran explores the application of feature selection mechanisms within the educational data mining (EDM) framework. Its focus is particularly on enhancing the predictive accuracy of student performance models through different feature selection methodologies.

Educational data mining represents a specialized domain that utilizes data mining techniques to extract and analyze complex educational data. Its applications include predicting student characteristics, academic performances, and identifying factors inhibiting academic success. These predictions are crucial for educational institutions to tailor interventions for students based on their performance data.

Examination of Feature Selection Approaches

Feature selection in EDM is indispensable, as selecting the most pertinent features influences the predictive accuracy of models significantly. The paper investigates six filter-based feature selection algorithms: Correlation-based (CB), Chi-Square (CH), Gain-Ratio (GR), Information-Gain (IG), Relief (RF), and Symmetrical Uncertainty (SU). These algorithms assess the predictor variables' relevance based on inherent data properties. The importance of optimal feature subset dimensionality is underscored, considering computational efficiency and accuracy of predictions.

The data set comprises responses from 1969 higher secondary students across Tamil Nadu, India, encompassing demographic, socio-economic, academic, and environmental characteristics. The paper initially analyzes 32 predictive variables with a binary response variable related to the students' academic success.

Evaluation Metrics and Outcomes

The robustness of feature selection methods is evaluated using ROC and F1-measure values, with NaiveBayes used as the baseline classifier. The IG technique with seven features achieves notable predictive accuracy with a ROC value of 0.729, signifying its suitability for the high-dimensional student data set. Concurrently, the CH, IG, and SU techniques exhibit strong F1-measure values, further corroborating their efficacy in feature selection.

Classifier Benchmarking and Discussion

Multiple classifiers, including NaiveBayes, Voted Perceptron, OneR, and PART, were employed to test the feature subsets derived from different algorithms. The IG-7 subset (Information Gain with seven features) consistently demonstrated superior predictive performances across models, with predictive accuracies exceeding 89% for certain classifiers. This result emphasizes the importance of diminished dimensionality where accuracy is retained without inundating models with superfluous features.

Implications and Potential for Future Research

The comparative analysis in this paper establishes a systematic approach to feature selection in EDM, offering insights into algorithm selection based on dataset characteristics. It underscores the necessity for balanced dimensionality—adequately capturing essential attributes while eschewing redundancy. These findings are instrumental for optimizing computational resources and refining classification outcomes in practical educational settings.

For future research, further exploration of advanced feature selection techniques could enhance understanding of feature interactions within high-dimensional educational data. Additionally, incorporating new classifier algorithms and constructing more intricate prediction models can further underscore the role of feature selection in learning analytics and student performance forecasting.

In conclusion, the paper contributes to the EDM field by providing a detailed critique on feature selection methods—integral to predictive model construction. These insights are valuable for academic researchers seeking effective methodologies for data preprocessing and classifier accuracy enhancement in educational contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

M. Ramaswami (2 papers)
R. Bhaskaran (5 papers)

Citations (164)

View on Semantic Scholar