Papers
Topics
Authors
Recent
Search
2000 character limit reached

MacroHFT: Memory-Augmented Crypto HFT

Updated 10 February 2026
  • MacroHFT is a reinforcement learning framework that uses memory modules and context-aware sub-agents to dynamically adapt to non-stationary cryptocurrency markets.
  • It decomposes market data into regime-specific segments, training specialized sub-agents whose policies are soft-mixed by a hyper-agent for robust decision-making.
  • Empirical results demonstrate significant enhancements in profitability and risk-adjusted metrics across major crypto pairs compared to traditional HFT methods.

MacroHFT is a memory-augmented, context-aware reinforcement learning (RL) framework for high-frequency trading (HFT) on cryptocurrency markets, designed to address conventional RL vulnerabilities such as overfitting and limited adaptability to non-stationary financial conditions. The system combines multiple specialized sub-agents—each conditioned on distinct market regimes of trend and volatility—with a hyper-agent via soft policy mixing and an episodic memory module, yielding a meta-policy capable of robust decision-making under rapid market fluctuations and rare events. Extensive empirical evaluation demonstrates state-of-the-art performance in minute-level trading across major cryptocurrency pairs, with strong improvements in both profitability and risk-adjusted metrics (Zong et al., 2024).

1. Problem Formulation in MacroHFT

MacroHFT targets discrete-time, minute-level HFT in high-dimensional financial microstructures. At each time step tt, the system observes an MM-level limit order book

bt={(ptbi,qtbi),(ptai,qtai)}i=1Mb_t = \{ (p_t^{b_i}, q_t^{b_i}), (p_t^{a_i}, q_t^{a_i}) \}_{i=1}^M

where pbi,qbip^{b_i}, q^{b_i} and pai,qaip^{a_i}, q^{a_i} are the ii-th best bid/ask prices and sizes. The state further comprises an OHLCV snapshot xt=(pto,pth,ptl,ptc,vt)x_t = (p_t^o, p_t^h, p_t^l, p_t^c, v_t) and technical indicators yt=ϕ(xth+1:t,bth+1:t)y_t = \phi(x_{t-h+1:t}, b_{t-h+1:t}) calculated over a lookback window of hh minutes.

The agent's position at each step is Pt{0,1}P_t \in \{0, 1\}, denoting holding either no coin or one unit. The net account value is Vt=Vct+PtptcV_t = V_{ct} + P_t \cdot p_t^c, where VctV_{ct} is cash.

MacroHFT formulates the trading process as a two-level Markov Decision Process (MDP):

  • Low-level MDP (sub-agents): States slt=(slt1,slt2,Pt)s_{lt} = (s_{lt}^1, s_{lt}^2, P_t), where slt1s_{lt}^1 are static features, slt2s_{lt}^2 are context features (historical sequences).
  • High-level MDP (hyper-agent): States sht=(sht1,sht2,Pt)s_{ht} = (s_{ht}^1, s_{ht}^2, P_t), with sht1s_{ht}^1 including all information from slts_{lt} and sht2s_{ht}^2 aggregating context (trend slope, volatility) over a global window hch_c.

Actions alt,aht{0,1}a_{lt}, a_{ht} \in \{0, 1\} represent target positions; the reward for both levels is

rlt=[alt(pt+1cptc)δaltPt]mr_{lt} = [a_{lt}(p_{t+1}^c-p_t^c) - \delta|a_{lt}-P_t|] \cdot m

with δ\delta as transaction cost and mm the trade size.

2. Sub-Agent Specialization and Conditional Context Adaptation

2.1 Market Regime Decomposition

The input time series is segmented into chunks of length lchunkl_\text{chunk} (e.g., 360 or 4320 minutes). For each chunk:

  • Compute (i) de-noised slope (trend label); (ii) average realized volatility (volatility label).
  • Partition the data into three slope quantiles (Bear, Flat, Bull) and three volatility quantiles (Stable, Medium, Volatile), producing six disjoint subsets.
  • Train a dedicated sub-agent for each subset, selecting the best epoch according to validation performance.

2.2 Conditional-Adapter DDQN Architecture

Each sub-agent adopts a dueling DDQN backbone with a conditional adapter that injects context embedding into policy computation. At time tt:

  • Encode slt1,slt2,Pts_{lt}^1, s_{lt}^2, P_t using neural functions ψ1,ψ2,ψ3\psi_1, \psi_2, \psi_3.
  • Aggregate context and position: c=ψ2(slt2)+ψ3(Pt)c = \psi_2(s_{lt}^2) + \psi_3(P_t).
  • Compute scale and shift factors (β,γ)=ψc(c)(\beta, \gamma) = \psi_c(c).
  • After layer normalization: h=LayerNorm(hs)β+γh = \text{LayerNorm}(h_s) \odot \beta + \gamma.
  • Output dueling Q-values:

Qsub(h,a)=V(h)+(A(h,a)1AaA(h,a))Q^{sub}(h,a) = V(h) + \left( A(h,a) - \frac{1}{|A|} \sum_{a'} A(h,a') \right)

The sub-agent loss combines DDQN temporal-difference and a KL divergence to an optimal Q-distribution:

Lsub=(rt+γQtsub(h,argmaxaQsub(h,a))Qsub(h,a))2+αlKL(Qsub(h,)Q(h,))L_\text{sub} = \left(r_t + \gamma Q^{sub}_t(h', \arg\max_{a'} Q^{sub}(h',a')) - Q^{sub}(h,a)\right)^2 + \alpha_l \text{KL}(Q^{sub}(h, \cdot) \parallel Q^*(h, \cdot))

3. Hyper-Agent Meta-Policy and Memory Mechanism

3.1 Meta-Policy via Soft Policy Mixing

The hyper-agent encodes shts_{ht} and applies a conditional adapter analogous to that of sub-agents. It produces logit scores uRNu \in \mathbb{R}^N across NN sub-agents, which are transformed into softmax weights wiw_i. The meta-Q value is a convex combination:

Qhyper(sht,a)=i=1NwiQisub(sht,a)Q^{hyper}(s_{ht}, a) = \sum_{i=1}^N w_i Q_i^{sub}(s_{ht}, a)

The hyper-agent selects aht=argmaxaQhyper(sht,a)a_{ht} = \arg\max_a Q^{hyper}(s_{ht}, a).

3.2 Episodic Memory Module

The memory module consists of a table M={(ki,(si,ai),vi)}M = \{ (k_i, (s_i, a_i), v_i) \}, where:

  • Keys ki=ψenc(si)RKk_i = \psi_{enc}(s_i) \in \mathbb{R}^K, generated by the hyper-agent encoder.
  • Value vi=ri+γmaxaQhyper(si,a)v_i = r_i + \gamma \max_{a'} Q^{hyper}(s'_i, a') is the TD target.
  • Query with the current state ss to find top-mm keys by L2 similarity, filter experiences where ai=aa_i = a, and aggregate using attention.

The memory-based Q is:

QM(s,a)=i=1mwiviQ_M(s,a) = \sum_{i=1}^m w_i v_i

3.3 Hyper-Agent Loss

The hyper-agent's loss incorporates (i) TD error, (ii) KL divergence to optimal Q-distribution, and (iii) memory regularization:

Lhyper=(rt+γQthyper(s,argmaxaQhyper(s,a))Qhyper(s,a))2+αhKL(Qhyper(s,)Q(s,))+β(Qhyper(s,a)QM(s,a))2L_\text{hyper} = \left( r_t + \gamma Q^{hyper}_t(s', \arg\max_{a'} Q^{hyper}(s', a')) - Q^{hyper}(s,a) \right)^2 + \alpha_h \text{KL}(Q^{hyper}(s, \cdot)\|Q^*(s, \cdot)) + \beta \left(Q^{hyper}(s, a) - Q_M(s, a)\right)^2

4. Training Pipeline

MacroHFT uses a two-phase curriculum:

  1. Sub-agent training:
    • Each of the 6 sub-agents is trained for 15 epochs on its respective regime-specific subset using Adam (learning rate 1×1041 \times 10^{-4}), with embedding dimension 64 and network width 128.
    • Chunk lengths lchunk{360,4320}l_{chunk} \in \{360, 4320\} and regularization αl{0,1,4}\alpha_l \in \{0,1,4\} selected by validation.
    • The optimal checkpoint for each regime is retained.
  2. Hyper-agent training:
    • Training for 15 epochs on the full dataset, embedding dimension 32, network width 128, same optimizer.
    • Memory capacity is set large enough to include latest experiences; regularization β{1,5}\beta \in \{1,5\} adjusted by validation.

5. Empirical Evaluation

5.1 Datasets and Feature Construction

MacroHFT is evaluated on four cryptocurrency markets: BTC/USDT, ETH/USDT, DOT/USDT, LTC/USDT, using minute-level price and 5-level order book data. Feature engineering includes raw LOB and OHLCV snapshots, low-level context (window h=60h=60 minutes), high-level context (trend and volatility), and technical indicators ϕ\phi as detailed in the paper appendix.

5.2 Baselines and Metrics

Comparison baselines include value-based (DQN, DDQN, CDQNRP), policy-based (PPO, CLSTM-PPO), hierarchical (EarnHFT), and rule-based (IV, MACD) approaches. Evaluation uses:

  • Total return (TR): VTV1V1\frac{V_T-V_1}{V_1}
  • Annualized volatility (AVOL): σ[r]m\sigma[r]\sqrt{m}
  • Maximum drawdown (MDD), Annualized Sharpe Ratio (ASR): E[r]σ[r]m\frac{\mathbb{E}[r]}{\sigma[r]}\sqrt{m},
  • Annualized Calmar Ratio (ACR), and Annualized Sortino Ratio (ASoR); where m=525600m=525600 minutes/year.

5.3 Test Performance

On the test set: | Ticker | TR (%) | ASR | MDD (%) | |------------|--------|------|---------| | BTC/USDT | 3.03 | 0.61 | 5.41 | | ETH/USDT | 39.28 | 3.89 | 9.67 | | DOT/USDT | 13.79 | 0.97 | 15.89 | | LTC/USDT | 18.16 | 1.50 | 14.24 |

MacroHFT leads in profit and most risk-adjusted metrics across all assets.

6. Architectural Insights, Limitations, and Prospects

MacroHFT's context-aware conditional adapter lets sub-agents avoid overfitting to static features, instead adapting rapidly to regime changes driven by slt2s_{lt}^2 and PtP_t. Soft-mixing yields more balanced policies than hard selection, handling bear/bull/volatile trends concurrently. The episodic memory module enforces Q-estimate consistency among similar states and supports prompt adaptation to atypical events by leveraging trajectory-attached targets.

Ablation results indicate removing either the conditional adapter or the memory mechanism reduces returns by 20–50%, with particularly adverse effects on drawdown in stressed markets (e.g., DOT, LTC).

Current limitations include restriction to a single long position per trade, lack of explicit latency/slippage modeling beyond a static transaction fee δ\delta, and sensitivity to hyperparameters (lchunkl_{chunk}, αl\alpha_l, β\beta). Practical deployment would benefit from incorporating slippage-aware execution, multi-asset extensions, and latency-resilient order handling.

In summary, MacroHFT's two-phase pipeline—specialized regime-adaptive sub-agents augmented by memory-regularized policy mixing—establishes new performance benchmarks in minute-level cryptocurrency HFT tasks, outperforming both standard and hierarchical RL, as well as traditional rule-based approaches (Zong et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MacroHFT Framework.