- The paper demonstrates a novel GA+LSTM model that improved dropout prediction accuracy to 97.65% on the ARQ dataset.
- The paper employs BERT-based semantic analysis to measure course similarities and identify prerequisite relationships in university curricula.
- The research offers practical applications for HEIs, including early intervention, improved curriculum design, and enhanced student advising.
This paper explores the use of AI and Semantic Technologies to improve student academic performance, focusing on two main areas: predicting student dropout and analyzing university curricula.
Problem & Motivation:
High student dropout rates are a significant concern for Higher Education Institutions (HEIs) and students. Additionally, curriculum design, including course prerequisites and sequencing, plays a vital role in student success and retention. Existing research had limited application of deep learning for dropout prediction and faced limitations in accurately measuring semantic similarity between courses for curriculum analysis.
Objectives:
The research aimed to:
- Predict student dropout using grades from previous semesters.
- Model semantic representations of courses and compute similarity between them.
- Identify prerequisite relationships (sequences) between similar courses.
Methodology:
Three main implementations were developed:
- Dropout Prediction:
- Data: Academic records (grades, status, course info, etc.) of 5,582 students across 3 degrees (Information Systems - CSI, Management - ADM, Architecture - ARQ) from a Brazilian university (248,730 records, 2001-2009).
- Preprocessing: Data cleaning, handling missing values (using Random Forest imputation), converting categorical data, normalization (z-score), and addressing class imbalance (using SMOTE for the CSI dataset).
- Feature Selection: A Genetic Algorithm (GA) combined with Support Vector Machine (SVM) fitness evaluation was used to select the most relevant features from the initial 27.
- Prediction Model: A Long Short-Term Memory (LSTM) network, followed by a 3-layer Fully Connected (FC) network, was trained on the selected features represented as time series data (32 time steps per student, representing 8 semesters). Mean Squared Error (MSE) was used as the loss function with the Adam optimizer.
- Course Similarity Measurement:
- Data: Course descriptions from the Australian National University (ANU) Computer Science ("COMP") program website.
- Encoding: Bidirectional Encoder Representations from Transformers (BERT) was used to generate contextual sentence embeddings (vectors) for each sentence in the course descriptions.
- Similarity Calculation: Cosine similarity was calculated between the sentence vectors. The overall similarity between two courses was determined by averaging the similarity scores across their sentences.
- Prerequisite Identification:
- Data: Same ANU course descriptions.
- Concept Extraction: TextRazor API was used to extract key concepts (entities) from the course descriptions.
- Prerequisite Measurement: Semi-Reference Distance (SemRefD), an extension of Reference Distance (RefD), was employed. SemRefD measures the prerequisite dependency between two concepts by querying the DBpedia knowledge graph, considering semantic properties and paths between concepts. The sum of SemRefD scores between concepts extracted from two courses indicates the overall prerequisite relationship (e.g., if Course A is a prerequisite for Course B).
Key Results & Contributions:
- Dropout Prediction: The GA+LSTM model achieved high accuracy, notably improving upon previous work by Manrique et al. (1903.10210) by 2.45% (reaching 97.65% accuracy) on the ARQ dataset. Performance varied slightly across datasets (ADM, ARQ, CSI), with feature selection identifying optimal subsets for each. Some instability (multiple descent) was observed during training loss, potentially due to dataset characteristics or hyperparameter choices (like high dropout rate).
- Systematic Review: A comprehensive review identified how Semantic Web and NLP technologies are used in CS curriculum analysis, highlighting limitations in existing similarity measures and inspiring the prerequisite identification approach.
- Course Similarity: Heatmaps visualized similarity scores between ANU COMP courses. Foundational courses (e.g., COMP1110) showed higher average similarity within their level, while similarity decreased and differentiation increased at higher levels (2000, 3000, 4000), reflecting specialization.
- Prerequisite Identification: Applied to three related ANU courses (COMP1100, COMP1110, COMP2100), the SemRefD analysis confirmed strong prerequisite relationships (COMP1100 -> COMP2100, COMP1110 -> COMP2100) and a weaker, more parallel relationship between COMP1100 and COMP1110, aligning with typical curriculum structure.
Practical Implications & Future Work:
The developed techniques offer practical applications for HEIs:
- Early Intervention: The dropout prediction model can identify at-risk students early, allowing for timely support.
- Curriculum Analysis & Design: Similarity and prerequisite identification tools can help analyze existing curricula, ensure logical sequencing, identify overlaps or gaps, and inform redesign efforts.
- Student Advising: These tools can aid advisors in guiding students through course selection.
- Recommendation Systems: Combining semantic analysis and student performance data could power course recommendation systems.
Future work includes refining the LSTM model (e.g., using dynamic time steps), using more balanced datasets for dropout prediction, and further developing tools for curriculum analysis and student support.