- The paper benchmarks six machine learning and deep learning models for short-term electricity forecasting in the Australian NEM.
- It demonstrates that tree-based ensembles, particularly GBRT, achieve the highest R² and lowest MAE for price prediction despite high MAPE, highlighting challenges in volatile markets.
- The analysis shows that while demand forecasting is more tractable with lower errors, traditional deep learning models struggle to capture extreme pricing events.
Comparative Evaluation of Machine Learning and Deep Learning Models for Short-Term Electricity Price Forecasting in the Australian NEM
Introduction
Short-term electricity price forecasting (EPF) is an operational imperative in energy markets, particularly in the volatile context of Australia's National Electricity Market (NEM). South Australia's high volatility, non-stationary pricing patterns, frequent negative price events, and regulatory shifts such as the transition to five-minute settlements impose severe challenges on EPF methodologies. The analyzed study (2604.23908) presents a systematic benchmark of six machine learning and deep learning models—AWMLSTM, CatBoost, GBRT, LSTM, LightGBM, and SVR—using unified datasets and consistent preprocessing pipelines, thus directly addressing reproducibility and comparability concerns prevalent in the literature.
Methodology
The study constructs a comprehensive experimental framework, including meticulous feature engineering (lags, rolling statistics, cyclic temporal encodings, interaction terms), rigorous data normalization, and a chronological 85%/15% train-test split. This ensures temporal consistency and eliminates look-ahead bias. All six models are fed the same feature set and normalized data, enabling a controlled comparison across algorithmic paradigms.
Model Portfolio
- AWMLSTM: Attention-augmented LSTM capturing long-term and salient temporal dependencies.
- CatBoost: Categorical boosting trees leveraging ordered boosting and native categorical feature support.
- GBRT: Classic gradient boosting regression trees, standard for ensemble-based nonlinear regression.
- LSTM: Vanilla recurrent network for sequential modeling with cell-based memory.
- LightGBM: Leaf-wise boosting with histogram-based optimization for scalability.
- SVR: Kernel-based support vector regression, agnostic to sequential structure.
Each model is stringently evaluated against both price and demand prediction targets with a suite of error metrics: MAE, MAPE, MSE, and coefficient of determination (R2).
Empirical Results
Tree-based models—GBRT, CatBoost, and LightGBM—consistently outperform both baseline deep learning (LSTM) and SVR across all relevant error metrics. Notably, GBRT achieves the highest R2 value (0.88) and the lowest MAE (13.25) for price prediction. However, the absolute mean absolute percentage errors (MAPE) are universally high (all models exceeding 90%; SVR surpasses 300%), and over 65% of GBRT predictions incur relative errors above 10%. This magnitude of error, despite large feature sets and advanced models, is a direct quantification of the difficulty inherent to short-term EPF under the market conditions in South Australia.
Deep learning models (AWMLSTM, LSTM) do not deliver competitive results for EPF, with LSTM yielding the lowest R2 (0.68) and extreme MAPE, highlighting issues around volatility adaptation and poor fit to rare pricing spikes or abrupt regime shifts. SVR provides the weakest performance due to inadequate temporal modeling capacity and sensitivity to kernel selection in high dimensions.
In contrast to price forecasting, demand prediction is significantly more tractable. AWMLSTM and GBRT both achieve an R2 of 0.96, MAE below 61.1, and MAPE below 32%. GBRT demonstrates the highest proportion of predictions within 5% (74.37%) and 10% (84.87%) error of ground truth values. LightGBM and CatBoost are also highly competitive, albeit with slight increases in bias and variance, particularly at extreme peaks and troughs—likely a function of limited sample support for rare demand events.
While AWMLSTM shows strength in capturing overall demand periodicity, the tree ensembles outperform in accuracy and robustness to unusual patterns. LSTM and SVR fail to provide reliable demand estimates, suffering from both systematic bias and poor tracking of extremes.
Discussion: Numerical Benchmarks and Implications
The study's key numerical findings are:
- For price forecasting, no model achieves operationally low errors: GBRT is best on R2 (0.88) and MAE (13.25), but high MAPE (>150%) and large portions of predictions with >10% relative error are universal across models.
- For demand forecasting, both GBRT and AWMLSTM achieve R2 = 0.96, with significantly lower error rates (MAPE < 32% for top models) and most test points predicted within ±10% error.
- LSTM and SVR are substantially inferior for both tasks, suggesting limited value from classic deep learning or kernel-based methods for these datasets and feature regimes in the NEM context.
The contradictory finding that deep learning (LSTM, AWMLSTM) underperforms tree-ensembles is particularly notable given widespread claims regarding the superiority of deep learning for sequential data. The results identify major shortcomings of LSTM-variants in highly volatile, nonstationary, and sparse-event domains like electricity pricing in deregulated, renewable-dominated markets.
The high MAPE across all price prediction models suggests that current methodological approaches—despite advanced preprocessing—fail to capture extreme, abrupt market events and price spikes, which remain a forecasting bottleneck in the NEM setting.
Theoretical and Practical Implications
The empirical evidence implies several directions for both research and practice:
- Tree-based ensembles (GBRT, LightGBM, CatBoost) retain dominance in structured, tabular time series where regime changes and volatility dominate the error landscape.
- Deep learning architectures require new architectures (e.g., transformer hybrids) or auxiliary mechanisms to contend with rare event forecasting and high-frequency volatility.
- The extreme inaccuracy in price forecasts (>90% MAPE) emphasizes the necessity for hybrid models, robust error correction schemas, and the inclusion of additional exogenous factors (weather, renewables, macroeconomic signals).
- Future systems should implement dynamic model switching or regime-aware ensembles to mitigate the inadequacy of single-model deployments in volatile markets.
- For demand prediction, integrating additional external features (weather, event calendars) and focusing on extreme-point accuracy may further narrow the already low error rates.
Future Directions
The study identifies several avenues for improving EPF:
- Hybrid models (tree ensembles + transformer layers) to specifically target episodic price spikes and nonstationarity.
- Data augmentation and synthetic resampling around extreme events to increase tail event awareness.
- Advanced error correction layers to systematically debias model tendencies towards under/overestimation.
- Explicit modeling of volatility regimes, potentially via regime-switching or meta-learning frameworks, for both price and demand forecasts.
Conclusion
This study delivers a rigorous, directly comparable analysis of six learning architectures for short-term price and demand forecasting in the South Australian NEM. Tree-based ensemble models, in particular GBRT, consistently outperform deep recurrent and kernel methods, especially in volatile, high-penetration renewable environments. However, price forecasting remains fundamentally challenging, with all models failing to achieve actionable accuracy on volatile, nonstationary series. Future progress depends on hybrid model innovations, robust data engineering for tail events, and adaptive frameworks that respond to rapid regime changes in both price and demand contexts.