MacroHFT: Memory-Augmented Crypto HFT
- MacroHFT is a reinforcement learning framework that uses memory modules and context-aware sub-agents to dynamically adapt to non-stationary cryptocurrency markets.
- It decomposes market data into regime-specific segments, training specialized sub-agents whose policies are soft-mixed by a hyper-agent for robust decision-making.
- Empirical results demonstrate significant enhancements in profitability and risk-adjusted metrics across major crypto pairs compared to traditional HFT methods.
MacroHFT is a memory-augmented, context-aware reinforcement learning (RL) framework for high-frequency trading (HFT) on cryptocurrency markets, designed to address conventional RL vulnerabilities such as overfitting and limited adaptability to non-stationary financial conditions. The system combines multiple specialized sub-agents—each conditioned on distinct market regimes of trend and volatility—with a hyper-agent via soft policy mixing and an episodic memory module, yielding a meta-policy capable of robust decision-making under rapid market fluctuations and rare events. Extensive empirical evaluation demonstrates state-of-the-art performance in minute-level trading across major cryptocurrency pairs, with strong improvements in both profitability and risk-adjusted metrics (Zong et al., 2024).
1. Problem Formulation in MacroHFT
MacroHFT targets discrete-time, minute-level HFT in high-dimensional financial microstructures. At each time step , the system observes an -level limit order book
where and are the -th best bid/ask prices and sizes. The state further comprises an OHLCV snapshot and technical indicators calculated over a lookback window of minutes.
The agent's position at each step is , denoting holding either no coin or one unit. The net account value is , where is cash.
MacroHFT formulates the trading process as a two-level Markov Decision Process (MDP):
- Low-level MDP (sub-agents): States , where are static features, are context features (historical sequences).
- High-level MDP (hyper-agent): States , with including all information from and aggregating context (trend slope, volatility) over a global window .
Actions represent target positions; the reward for both levels is
with as transaction cost and the trade size.
2. Sub-Agent Specialization and Conditional Context Adaptation
2.1 Market Regime Decomposition
The input time series is segmented into chunks of length (e.g., 360 or 4320 minutes). For each chunk:
- Compute (i) de-noised slope (trend label); (ii) average realized volatility (volatility label).
- Partition the data into three slope quantiles (Bear, Flat, Bull) and three volatility quantiles (Stable, Medium, Volatile), producing six disjoint subsets.
- Train a dedicated sub-agent for each subset, selecting the best epoch according to validation performance.
2.2 Conditional-Adapter DDQN Architecture
Each sub-agent adopts a dueling DDQN backbone with a conditional adapter that injects context embedding into policy computation. At time :
- Encode using neural functions .
- Aggregate context and position: .
- Compute scale and shift factors .
- After layer normalization: .
- Output dueling Q-values:
The sub-agent loss combines DDQN temporal-difference and a KL divergence to an optimal Q-distribution:
3. Hyper-Agent Meta-Policy and Memory Mechanism
3.1 Meta-Policy via Soft Policy Mixing
The hyper-agent encodes and applies a conditional adapter analogous to that of sub-agents. It produces logit scores across sub-agents, which are transformed into softmax weights . The meta-Q value is a convex combination:
The hyper-agent selects .
3.2 Episodic Memory Module
The memory module consists of a table , where:
- Keys , generated by the hyper-agent encoder.
- Value is the TD target.
- Query with the current state to find top- keys by L2 similarity, filter experiences where , and aggregate using attention.
The memory-based Q is:
3.3 Hyper-Agent Loss
The hyper-agent's loss incorporates (i) TD error, (ii) KL divergence to optimal Q-distribution, and (iii) memory regularization:
4. Training Pipeline
MacroHFT uses a two-phase curriculum:
- Sub-agent training:
- Each of the 6 sub-agents is trained for 15 epochs on its respective regime-specific subset using Adam (learning rate ), with embedding dimension 64 and network width 128.
- Chunk lengths and regularization selected by validation.
- The optimal checkpoint for each regime is retained.
- Hyper-agent training:
- Training for 15 epochs on the full dataset, embedding dimension 32, network width 128, same optimizer.
- Memory capacity is set large enough to include latest experiences; regularization adjusted by validation.
5. Empirical Evaluation
5.1 Datasets and Feature Construction
MacroHFT is evaluated on four cryptocurrency markets: BTC/USDT, ETH/USDT, DOT/USDT, LTC/USDT, using minute-level price and 5-level order book data. Feature engineering includes raw LOB and OHLCV snapshots, low-level context (window minutes), high-level context (trend and volatility), and technical indicators as detailed in the paper appendix.
5.2 Baselines and Metrics
Comparison baselines include value-based (DQN, DDQN, CDQNRP), policy-based (PPO, CLSTM-PPO), hierarchical (EarnHFT), and rule-based (IV, MACD) approaches. Evaluation uses:
- Total return (TR):
- Annualized volatility (AVOL):
- Maximum drawdown (MDD), Annualized Sharpe Ratio (ASR): ,
- Annualized Calmar Ratio (ACR), and Annualized Sortino Ratio (ASoR); where minutes/year.
5.3 Test Performance
On the test set: | Ticker | TR (%) | ASR | MDD (%) | |------------|--------|------|---------| | BTC/USDT | 3.03 | 0.61 | 5.41 | | ETH/USDT | 39.28 | 3.89 | 9.67 | | DOT/USDT | 13.79 | 0.97 | 15.89 | | LTC/USDT | 18.16 | 1.50 | 14.24 |
MacroHFT leads in profit and most risk-adjusted metrics across all assets.
6. Architectural Insights, Limitations, and Prospects
MacroHFT's context-aware conditional adapter lets sub-agents avoid overfitting to static features, instead adapting rapidly to regime changes driven by and . Soft-mixing yields more balanced policies than hard selection, handling bear/bull/volatile trends concurrently. The episodic memory module enforces Q-estimate consistency among similar states and supports prompt adaptation to atypical events by leveraging trajectory-attached targets.
Ablation results indicate removing either the conditional adapter or the memory mechanism reduces returns by 20–50%, with particularly adverse effects on drawdown in stressed markets (e.g., DOT, LTC).
Current limitations include restriction to a single long position per trade, lack of explicit latency/slippage modeling beyond a static transaction fee , and sensitivity to hyperparameters (, , ). Practical deployment would benefit from incorporating slippage-aware execution, multi-asset extensions, and latency-resilient order handling.
In summary, MacroHFT's two-phase pipeline—specialized regime-adaptive sub-agents augmented by memory-regularized policy mixing—establishes new performance benchmarks in minute-level cryptocurrency HFT tasks, outperforming both standard and hierarchical RL, as well as traditional rule-based approaches (Zong et al., 2024).