OULA Learning Analytics Dataset

Updated 18 January 2026

The OULA Dataset is a multi-relational, anonymized educational resource capturing student demographics, registrations, assessments, and VLE interactions for empirical analyses.
It supports diverse predictive models including time series, deep learning, and network-based techniques to forecast student outcomes and engagement levels.
The dataset's comprehensive schema and preprocessing protocols establish a benchmark for research in MOOC modeling, early-warning systems, and educational data mining.

The Open University Learning Analytics (OULA) Dataset is a large-scale, multi-relational, anonymized educational dataset from the UK Open University, systematically designed to support empirical research in learning analytics, student success prediction, MOOC modeling, and early-warning systems. The resource captures student demographics, registration records, assessment performance, and granular Virtual Learning Environment (VLE) interaction logs, providing a foundation for state-of-the-art machine learning, time series, and network-based analysis. The dataset's consistent, well-documented schema, substantial scale (32,593 students across seven major modules), and inclusion of longitudinal, behavioral, and contextual features have made it a benchmark for research in predictive analytics and educational data mining.

1. Data Provenance, Scope, and Structure

The OULAD, as released by Kuzilek et al. (2017) and documented in multiple subsequent studies, covers 32,593 unique students enrolled in 22 module presentations (“presentations” denote specific semesterly offerings of one of seven modules, covering both STEM and Social Sciences) (Junejo et al., 2024, Howard, 14 Jan 2025, Ayady et al., 17 Jul 2025, Tertulino, 23 Aug 2025, Muresan et al., 11 Jan 2026). Each presentation typically spans 9 months. The relational schema comprises:

studentInfo.csv: demographic and background information, including age_band, gender, highest_education (ordered categorical), imd_band (Index of Multiple Deprivation quintile), disability status, and final outcome (final_result ∈ {Distinction, Pass, Fail, Withdrawn}).
studentRegistration.csv: registration and (if applicable) unregistration dates, studied_credits, number of previous attempts.
assessments.csv and studentAssessment.csv: metadata for each assessment (type, due date, weight) and per-student submissions (score, submission date, reactivity).
studentVLE.csv and vle.csv: per-student, per-resource logs of click counts, mapped to VLE activity types (e.g., quiz, forum, homepage).

Foreign keys (code_module, code_presentation, id_student) provide table linkage. The full structure enables the reconstruction of each student’s academic trajectory and engagement, mapped to temporal and categorical precision.

2. Feature Engineering and Preprocessing Protocols

OULAD supports a wide spectrum of feature engineering strategies, with canonical workflows including:

Demographic and contextual features: Extracted directly from studentInfo, encoded as factors or, when appropriate, one-hot encoded.
Assessment and registration metrics: Engineered features include:
- Studied credits (numeric), num_of_prev_attempts (integer).
- Total registration days, computed as unregistration_date − registration_date; unregistration_date is imputed as 270 (course nominal length) for non-withdrawn cases.
- Assessment summary statistics (means, weights by module/presentation), calculated as:
$\mathrm{weight}_{m,p} = \frac{1}{|\mathcal A_{m,p}|} \sum_{a \in \mathcal A_{m,p}} \mathrm{weight}_a$
Clickstream aggregation: Daily total_clicks per student, per resource, per module/presentation. Where R is the set of VLE resource records for student s on date d:

$\mathrm{total\_clicks}_{s,m,p,d} = \sum_{r \in R_{s,m,p,d}} \mathrm{clicks}_r$

Data are further aggregated to weekly or cumulative counts, or pivoted by activity_type for behavioral modeling (Ayady et al., 17 Jul 2025, Tertulino, 23 Aug 2025).

Handling missing values and class imbalance: Unregistration_date imputed as 270 for non-withdrawn; missing clickstream or assessment features are assigned zero; class imbalance addressed with explicit class weights (e.g., Distinction=1.5, Fail=1.5, Pass=1.0, Withdrawn=1.0) (Junejo et al., 2024), or with SMOTE on local data silos for federated-learning protocols (Tertulino, 23 Aug 2025).
Numerical scaling: Standard z-score normalization for all numeric features:

$x_i' = \frac{x_i - \mu}{\sigma}$
Temporal filtering for early prediction: Datasets prepared for fractions $f \in \{5\%, 10\%, 20\%, \dots, 100\%\}$ of a module, facilitating early-warning experimentation.
Label encoding: For multiclass tasks, the integer mapping 0=Distinction, 1=Fail, 2=Pass, 3=Withdrawn is used (Junejo et al., 2024).

3. Application Scenarios and Research Methodologies

OULAD has been foundational in a range of prediction and analytics pipelines:

Multiclass outcome forecasting using DNNs, 1D-CNN, LSTM and baseline models (Random Forests, ANN-LSTM): Predicting outcomes across {Distinction, Pass, Fail, Withdrawn}, outperforming prior binary-focused studies and enabling student stratification at early course stages (Junejo et al., 2024).
Time series modeling: Weekly multivariate time series are constructed to model engagement as $X_i = \{ x_{i,1}, x_{i,2},\ldots \}$ , with $x_{i,t} \in \mathbb{R}^d$ the vector of activity_type-wise clicks per week; classification using DTW-KNN, MLP, LSTM, and FCN architectures (Ayady et al., 17 Jul 2025).
Dynamic features and heterogeneous graph neural networks: Introduction of “Partial Grade” (cumulative weighted assignment score up to time t) as a dynamic feature tracked at 13 checkpoints through the semester, embedded in node features for registration nodes in a bipartite or metapath-based GNN (HAN/HGT) architecture (Muresan et al., 11 Jan 2026).
Federated learning and privacy-aware early warnings: Institution-level data silos (modules as clients) with federated Logistic Regression and DNN models, incorporating early performance (average assessment score in first 90 days) and digital engagement (aggregate and by-activity clicks), with local data balancing via SMOTE (Tertulino, 23 Aug 2025).
Accessible R-based preprocessing: The ouladFormat package exposes single-call functions to intake and join multi-table OULAD schemas into user-ready tibbles, supporting reproducibility, comparability, and rapid exploratory analysis (Howard, 14 Jan 2025).

4. Class Distributions, Feature Statistics, and Data Properties

OULAD exhibits non-uniform outcome class distribution with a Pass-majority and minority Withdrawn class:

Class	Count	Percentage
Distinction	5,860	17.98%
Pass	19,388	59.49%
Fail	4,312	13.23%
Withdrawn	3,033	9.30%

Correlations with the final_result outcome reveal that total_reg_days is negatively associated ( $r = -0.30$ ), with additional but weaker contributions from studied_credits, highest_education, imd_band, num_of_prev_attempts, and total_clicks (Junejo et al., 2024). After filtering for “at-risk” prediction focused on completors, data reduces to 22,437 examples (at_risk = 1 in 18.7%, at_risk = 0 in 81.3%) (Tertulino, 23 Aug 2025). After data balancing (SMOTE), each class is at approximate parity.

Key engineered feature statistics before and after balancing:

Feature	Mean (pre)	Mean (post SMOTE)
average_early_score	58.2	58.2
early_assessments_cnt	2.8	3.5
total_clicks	210.4	240.6
clicks_on_quiz	35.7	42.0
clicks_on_forum	12.1	15.3

5. Model Training, Splitting, and Evaluation Protocols

Typical experimental protocols employ:

Stratified train/validation/test splits, often with a 70/30 partition, further reserving 10% of training for validation (Keras’s validation_split).
Cohort-based splits (e.g., 2013B+2013J for training, 2014B+2014J for testing) for strict temporal generalization (Ayady et al., 17 Jul 2025).
Time-point or “early fraction” analysis: For early-warning, models are evaluated with only the first $f$ % or first $t$ weeks of data available, illustrating true cold-start and early-intervention performance (Junejo et al., 2024, Muresan et al., 11 Jan 2026).
Metrics: Accuracy, precision, recall, F1-score, and ROC AUC for binary tasks.

State-of-the-art models report multicategory F1-scores on the order of 0.62–0.68 (20 weeks in STEM courses), up to 0.94 F1 (binary, 40 weeks, dense STEM), and early performance of 0.68 F1 using graph deep learning with dynamic assessment features covering only 7% of term data (Ayady et al., 17 Jul 2025, Muresan et al., 11 Jan 2026). Model comparisons consistently show deep architectures (CNN, LSTM, FCN, HGT) outperforming classical baselines, particularly for early detection and with denser behavioral data.

6. Challenges, Limitations, and Best Practices

Challenges in using OULAD include:

Class imbalance, especially “Withdrawn” and “Fail,” requiring explicit balancing or class-weighting for robust model training (Junejo et al., 2024, Tertulino, 23 Aug 2025).
Clickstream sparsity, especially for low-engagement or SHS modules, motivating aggregation, hybrid models, or imputation protocols (Ayady et al., 17 Jul 2025).
Missing values, addressed via default imputation (e.g., 0 for missing clicks, nominal 270 for unregistration_date).
Potential label leakage: Exclusion of exam scores from features to prevent trivial temporal leakage in performance modeling (Muresan et al., 11 Jan 2026).
Limited generalization: Data restricted to seven modules and 2013–2014; broader MOOC datasets (e.g., FutureLearn) are not included (Howard, 14 Jan 2025).
High per-record dimensionality and storage demands: Full clickstream tables consist of >10 million events, necessitating efficient storage and processing strategies.

Best practices include feature aggregation at multiple granularities, inclusion of both demographic/contextual and behavioral data, early-fraction validation, and cross-course transfer learning for robust model development (Ayady et al., 17 Jul 2025, Howard, 14 Jan 2025). The ouladFormat R package formalizes much of this preprocessing for reproducibility.

7. Research Impact and Methodological Advances

OULAD has catalyzed innovations in learning analytics, providing a standards-based testbed for:

Multicategory early-warning systems with DNN and CNN (Junejo et al., 2024).
Temporal and sequential modeling via LSTM/FCN and time series methods (Ayady et al., 17 Jul 2025).
Graph deep learning for relational academic analysis, leveraging dynamic feature construction and cross-instance aggregation (Muresan et al., 11 Jan 2026).
Federated and privacy-preserving learning analytics, simulating institutional data silos and local balancing strategies (Tertulino, 23 Aug 2025).
Accessible, reproducible data handling via programmatic tools (ouladFormat R package), lowering research barriers and facilitating fair cross-study comparison (Howard, 14 Jan 2025).

The dataset’s enduring influence stems from its comprehensive schema, public availability, and versatility in supporting new algorithms and reproducible benchmarks in education-focused data science.