Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data with Machine Learning Methods (1705.03233v5)

Published 9 May 2017 in cs.CE and q-fin.TR

Abstract: Managing the prediction of metrics in high-frequency financial markets is a challenging task. An efficient way is by monitoring the dynamics of a limit order book to identify the information edge. This paper describes the first publicly available benchmark dataset of high-frequency limit order markets for mid-price prediction. We extracted normalized data representations of time series data for five stocks from the NASDAQ Nordic stock market for a time period of ten consecutive days, leading to a dataset of ~4,000,000 time series samples in total. A day-based anchored cross-validation experimental protocol is also provided that can be used as a benchmark for comparing the performance of state-of-the-art methodologies. Performance of baseline approaches are also provided to facilitate experimental comparisons. We expect that such a large-scale dataset can serve as a testbed for devising novel solutions of expert systems for high-frequency limit order book data analysis.

Citations (108)

View on Semantic Scholar

Summary

The paper introduces a new, publicly available benchmark dataset from NASDAQ Nordic for training machine learning models to forecast mid-prices using limit order book data.
The dataset includes approximately 4,000,000 samples across five stocks, detailing methodology including three normalization techniques and anchored cross-validation.
Baseline experiments using ridge regression and a neural network achieved an average F1 score of approximately 46%, demonstrating the dataset's utility for training ML models for high-frequency trading analysis.

Analysis of Limit Order Book Data for Mid-Price Forecasting

The paper "Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data with Machine Learning Methods" by Ntakaris et al. introduces a comprehensive dataset critical for exploring ML applications in understanding high-frequency trading (HFT) dynamics. The researchers offer a novel contribution by compiling an extensive and publicly accessible dataset sourced from the NASDAQ Nordic stock market, which spans ten consecutive trading days across five stocks, resulting in approximately 4,000,000 samples. This dataset serves as a foundational platform for mid-price prediction of limit order book (LOB) data, a task of crucial interest to financial markets aiming to derive actionable insights from high-frequency data.

Methodology and Dataset Overview

The dataset is meticulously compiled using the ITCH data feed, providing high fidelity to time-ordered sequences of market events. It encompasses a wide array of data types, such as order submissions, cancellations, trade executions, and other market-relevant messages. The authors meticulously describe the extraction and normalization processes used to transform raw ITCH data into a structured and usable format. This is essential for ensuring ML models' accuracy in such a data-rich and noisy environment.

An innovative aspect of this paper is its approach to normalization and cross-validation. The team provides three normalization techniques: z-score, min-max, and decimal precision, catering to different data preprocessing needs. The experimental protocol employs a day-based anchored cross-validation format, which is crucial for ensuring robust model training and testing by mitigating overfitting through temporal validation splits.

Empirical Evaluation

To establish baseline comparisons, the authors implement ridge regression and a single hidden layer feedforward neural network (SLFN), where they perform both linear and nonlinear regression analyses. The empirical results indicate variability in prediction accuracy across normalization strategies and projection horizons. Notably, the F1 score, which is a balanced metric for model performance, was an average of approximately 46% across experiments. This figure stands in promising light given the data's inherent complexity and the nascent stages of applying ML models to such HFT dynamics.

Implications and Future Directions

The implications of the dataset and findings are multifaceted. Practically, successful application in predicting mid-price movements offers tangible benefits for market participants, including liquidity providers and speculative traders. By understanding market stability and predicting price movements, market makers can enhance liquidity provision, and traders can better forecast market shifts.

Crucially, the dataset opens avenues for future research into market manipulation tactics such as order book spoofing. Analysts could leverage this dataset to construct models identifying abnormal patterns that signify market manipulation attempts, useful to regulators.

Conclusion

The authors of this paper significantly contribute to the field by providing a foundational dataset and establishing baseline methods simplifying subsequent exploration of ML techniques in HFT. This paper lays robust groundwork for future investigations into more advanced ML models that can more accurately predict LOB dynamics. Future research could focus on integrating more sophisticated neural architectures, such as deep learning models, to enhance predictive performance and uncover deeper insights into market microstructure. As researchers further refine these models, the financial sector stands to benefit from more advanced, precise, and dynamic understanding and forecasting of market behaviors.

Related Papers

YouTube

Show All Videos