MacroHFT: Memory-Augmented Crypto HFT

Updated 10 February 2026

MacroHFT is a reinforcement learning framework that uses memory modules and context-aware sub-agents to dynamically adapt to non-stationary cryptocurrency markets.
It decomposes market data into regime-specific segments, training specialized sub-agents whose policies are soft-mixed by a hyper-agent for robust decision-making.
Empirical results demonstrate significant enhancements in profitability and risk-adjusted metrics across major crypto pairs compared to traditional HFT methods.

MacroHFT is a memory-augmented, context-aware reinforcement learning (RL) framework for high-frequency trading (HFT) on cryptocurrency markets, designed to address conventional RL vulnerabilities such as overfitting and limited adaptability to non-stationary financial conditions. The system combines multiple specialized sub-agents—each conditioned on distinct market regimes of trend and volatility—with a hyper-agent via soft policy mixing and an episodic memory module, yielding a meta-policy capable of robust decision-making under rapid market fluctuations and rare events. Extensive empirical evaluation demonstrates state-of-the-art performance in minute-level trading across major cryptocurrency pairs, with strong improvements in both profitability and risk-adjusted metrics (Zong et al., 2024).

1. Problem Formulation in MacroHFT

MacroHFT targets discrete-time, minute-level HFT in high-dimensional financial microstructures. At each time step $t$ , the system observes an $M$ -level limit order book

$b_t = \{ (p_t^{b_i}, q_t^{b_i}), (p_t^{a_i}, q_t^{a_i}) \}_{i=1}^M$

where $p^{b_i}, q^{b_i}$ and $p^{a_i}, q^{a_i}$ are the $i$ -th best bid/ask prices and sizes. The state further comprises an OHLCV snapshot $x_t = (p_t^o, p_t^h, p_t^l, p_t^c, v_t)$ and technical indicators $y_t = \phi(x_{t-h+1:t}, b_{t-h+1:t})$ calculated over a lookback window of $h$ minutes.

The agent's position at each step is $P_t \in \{0, 1\}$ , denoting holding either no coin or one unit. The net account value is $V_t = V_{ct} + P_t \cdot p_t^c$ , where $V_{ct}$ is cash.

MacroHFT formulates the trading process as a two-level Markov Decision Process (MDP):

Low-level MDP (sub-agents): States $s_{lt} = (s_{lt}^1, s_{lt}^2, P_t)$ , where $s_{lt}^1$ are static features, $s_{lt}^2$ are context features (historical sequences).
High-level MDP (hyper-agent): States $s_{ht} = (s_{ht}^1, s_{ht}^2, P_t)$ , with $s_{ht}^1$ including all information from $s_{lt}$ and $s_{ht}^2$ aggregating context (trend slope, volatility) over a global window $h_c$ .

Actions $a_{lt}, a_{ht} \in \{0, 1\}$ represent target positions; the reward for both levels is

$r_{lt} = [a_{lt}(p_{t+1}^c-p_t^c) - \delta|a_{lt}-P_t|] \cdot m$

with $\delta$ as transaction cost and $m$ the trade size.

2. Sub-Agent Specialization and Conditional Context Adaptation

2.1 Market Regime Decomposition

The input time series is segmented into chunks of length $l_\text{chunk}$ (e.g., 360 or 4320 minutes). For each chunk:

Compute (i) de-noised slope (trend label); (ii) average realized volatility (volatility label).
Partition the data into three slope quantiles (Bear, Flat, Bull) and three volatility quantiles (Stable, Medium, Volatile), producing six disjoint subsets.
Train a dedicated sub-agent for each subset, selecting the best epoch according to validation performance.

2.2 Conditional-Adapter DDQN Architecture

Each sub-agent adopts a dueling DDQN backbone with a conditional adapter that injects context embedding into policy computation. At time $t$ :

Encode $s_{lt}^1, s_{lt}^2, P_t$ using neural functions $\psi_1, \psi_2, \psi_3$ .
Aggregate context and position: $c = \psi_2(s_{lt}^2) + \psi_3(P_t)$ .
Compute scale and shift factors $(\beta, \gamma) = \psi_c(c)$ .
After layer normalization: $h = \text{LayerNorm}(h_s) \odot \beta + \gamma$ .
Output dueling Q-values:

$Q^{sub}(h,a) = V(h) + \left( A(h,a) - \frac{1}{|A|} \sum_{a'} A(h,a') \right)$

The sub-agent loss combines DDQN temporal-difference and a KL divergence to an optimal Q-distribution:

$L_\text{sub} = \left(r_t + \gamma Q^{sub}_t(h', \arg\max_{a'} Q^{sub}(h',a')) - Q^{sub}(h,a)\right)^2 + \alpha_l \text{KL}(Q^{sub}(h, \cdot) \parallel Q^*(h, \cdot))$

3. Hyper-Agent Meta-Policy and Memory Mechanism

3.1 Meta-Policy via Soft Policy Mixing

The hyper-agent encodes $s_{ht}$ and applies a conditional adapter analogous to that of sub-agents. It produces logit scores $u \in \mathbb{R}^N$ across $N$ sub-agents, which are transformed into softmax weights $w_i$ . The meta-Q value is a convex combination:

$Q^{hyper}(s_{ht}, a) = \sum_{i=1}^N w_i Q_i^{sub}(s_{ht}, a)$

The hyper-agent selects $a_{ht} = \arg\max_a Q^{hyper}(s_{ht}, a)$ .

3.2 Episodic Memory Module

The memory module consists of a table $M = \{ (k_i, (s_i, a_i), v_i) \}$ , where:

Keys $k_i = \psi_{enc}(s_i) \in \mathbb{R}^K$ , generated by the hyper-agent encoder.
Value $v_i = r_i + \gamma \max_{a'} Q^{hyper}(s'_i, a')$ is the TD target.
Query with the current state $s$ to find top- $m$ keys by L2 similarity, filter experiences where $a_i = a$ , and aggregate using attention.

The memory-based Q is:

$Q_M(s,a) = \sum_{i=1}^m w_i v_i$

3.3 Hyper-Agent Loss

The hyper-agent's loss incorporates (i) TD error, (ii) KL divergence to optimal Q-distribution, and (iii) memory regularization:

$L_\text{hyper} = \left( r_t + \gamma Q^{hyper}_t(s', \arg\max_{a'} Q^{hyper}(s', a')) - Q^{hyper}(s,a) \right)^2 + \alpha_h \text{KL}(Q^{hyper}(s, \cdot)\|Q^*(s, \cdot)) + \beta \left(Q^{hyper}(s, a) - Q_M(s, a)\right)^2$

4. Training Pipeline

MacroHFT uses a two-phase curriculum:

Sub-agent training:
- Each of the 6 sub-agents is trained for 15 epochs on its respective regime-specific subset using Adam (learning rate $1 \times 10^{-4}$ ), with embedding dimension 64 and network width 128.
- Chunk lengths $l_{chunk} \in \{360, 4320\}$ and regularization $\alpha_l \in \{0,1,4\}$ selected by validation.
- The optimal checkpoint for each regime is retained.
Hyper-agent training:
- Training for 15 epochs on the full dataset, embedding dimension 32, network width 128, same optimizer.
- Memory capacity is set large enough to include latest experiences; regularization $\beta \in \{1,5\}$ adjusted by validation.

5. Empirical Evaluation

5.1 Datasets and Feature Construction

MacroHFT is evaluated on four cryptocurrency markets: BTC/USDT, ETH/USDT, DOT/USDT, LTC/USDT, using minute-level price and 5-level order book data. Feature engineering includes raw LOB and OHLCV snapshots, low-level context (window $h=60$ minutes), high-level context (trend and volatility), and technical indicators $\phi$ as detailed in the paper appendix.

5.2 Baselines and Metrics

Comparison baselines include value-based (DQN, DDQN, CDQNRP), policy-based (PPO, CLSTM-PPO), hierarchical (EarnHFT), and rule-based (IV, MACD) approaches. Evaluation uses:

Total return (TR): $\frac{V_T-V_1}{V_1}$
Annualized volatility (AVOL): $\sigma[r]\sqrt{m}$
Maximum drawdown (MDD), Annualized Sharpe Ratio (ASR): $\frac{\mathbb{E}[r]}{\sigma[r]}\sqrt{m}$ ,
Annualized Calmar Ratio (ACR), and Annualized Sortino Ratio (ASoR); where $m=525600$ minutes/year.

5.3 Test Performance

On the test set: | Ticker | TR (%) | ASR | MDD (%) | |------------|--------|------|---------| | BTC/USDT | 3.03 | 0.61 | 5.41 | | ETH/USDT | 39.28 | 3.89 | 9.67 | | DOT/USDT | 13.79 | 0.97 | 15.89 | | LTC/USDT | 18.16 | 1.50 | 14.24 |

MacroHFT leads in profit and most risk-adjusted metrics across all assets.

6. Architectural Insights, Limitations, and Prospects

MacroHFT's context-aware conditional adapter lets sub-agents avoid overfitting to static features, instead adapting rapidly to regime changes driven by $s_{lt}^2$ and $P_t$ . Soft-mixing yields more balanced policies than hard selection, handling bear/bull/volatile trends concurrently. The episodic memory module enforces Q-estimate consistency among similar states and supports prompt adaptation to atypical events by leveraging trajectory-attached targets.

Ablation results indicate removing either the conditional adapter or the memory mechanism reduces returns by 20–50%, with particularly adverse effects on drawdown in stressed markets (e.g., DOT, LTC).

Current limitations include restriction to a single long position per trade, lack of explicit latency/slippage modeling beyond a static transaction fee $\delta$ , and sensitivity to hyperparameters ( $l_{chunk}$ , $\alpha_l$ , $\beta$ ). Practical deployment would benefit from incorporating slippage-aware execution, multi-asset extensions, and latency-resilient order handling.

In summary, MacroHFT's two-phase pipeline—specialized regime-adaptive sub-agents augmented by memory-regularized policy mixing—establishes new performance benchmarks in minute-level cryptocurrency HFT tasks, outperforming both standard and hierarchical RL, as well as traditional rule-based approaches (Zong et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

MacroHFT: Memory Augmented Context-aware Reinforcement Learning On High Frequency Trading (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MacroHFT Framework.