JaxMARL-HFT: GPU-Accelerated MARL for HFT

Updated 9 November 2025

JaxMARL-HFT is a GPU-accelerated framework for high-frequency trading that leverages multi-agent reinforcement learning and granular market-by-order data.
It employs JAX-based parallelism and just-in-time compilation to deliver substantial speed-ups and scalable agent-based modeling.
The framework supports heterogeneous agent types with tailored observation and action spaces, enabling efficient policy updates via independent PPO pipelines.

JaxMARL-HFT is a GPU-accelerated, open-source multi-agent reinforcement learning (MARL) framework designed for large-scale simulation and training in high-frequency trading (HFT) environments, specifically leveraging granular market-by-order (MBO) data. It extends the JaxMARL library of MARL algorithms and the JAX-LOB GPU-native limit order book (LOB) simulator, introducing a scalable infrastructure to support heterogeneous agent populations and efficient end-to-end training on real-world financial datasets. The framework directly addresses the computational bottlenecks in agent-based modeling (ABM) for HFT by implementing JAX-based parallelism and just-in-time (JIT) compilation, enabling a substantial speed-up in environment simulations and MARL updates.

1. Framework Composition and Computational Design

JaxMARL-HFT is constructed atop two foundational open-source projects: JAX-LOB (Frey et al. 2023), which provides high-throughput, GPU-native simulation of the limit order book for MBO replay, and JaxMARL (Rutherford et al. 2024), a suite of multi-agent RL algorithms implemented in JAX. The framework exposes a standard MARL environment API:

reset(): Initializes and returns observations for $N$ agents.
step(actions): Processes agent actions, returns the next state, rewards, done-flags, and info dicts.

Each step in the environment executes the following pipeline: 1. Converts agent actions to LOB messages (order placements/cancellations).

Concatenates these with a historical stream of MBO messages.
Processes the blended message stream through the JAX-LOB simulator.
Extracts trade results, updates inventories and cash balances, computes observations, and calculates rewards.

All MBO data is preloaded as a contiguous GPU array. Episodes are slices starting at GPU-indexed offsets, with no padding or copying, enabling initialization by restoring the order book state at the specific message index.

JaxMARL-HFT implements a two-level parallelization using JAX’s vmap: the outer axis represents thousands (e.g., 4096) of parallel episode rollouts; the inner axis batches distinct agent types within each episode. This architecture permits highly heterogeneous observation/action spaces and reward calculations per agent type without the need for data padding. The MARL step and agent-policy updates are both JIT-compiled on the GPU, eliminating CPU-GPU data transfer overhead.

Performance and Speed-up

Performance gains are quantified as follows:

Task	Baseline (CPU)	JaxMARL-HFT (GPU)	Speed-up
Env-only, 1 msg/step, 10 ag/ty	114 steps/s	48,312 steps/s	~424×
Full MARL, 1 agent/type	(ABIDES-gym CPU)		5×–35×
Full MARL, 5 agents/type	(ABIDES-gym CPU)		95×–125×
Full MARL, 10 agents/type	(ABIDES-gym CPU)		200×–240×

The speed-up increases as the number of concurrent learning agents rises and as fewer MBO messages are processed per RL step, emphasizing scalability to extensive agent populations and massive datasets.

2. Supported Agent Types and Environment Configuration

JaxMARL-HFT is architected to support a broad set of agent archetypes operating concurrently, each potentially with distinct observation and action spaces and tailored reward functions. The primary supported types are:

Market-Making (MM) Agents: Manage inventory and provide liquidity, observing book state and recent price/volume statistics.
Order-Execution (EXEC) Agents: Seek to execute a defined order flow efficiently, with optional visibility into detailed queue lengths.
Directional Trading (DIR) Agents: Execute speculative strategies based on book features and price imbalances.

Observation spaces are typically concise, fixed-length vectors per agent type. MM and DIR agents observe features such as inventory, cash, book top-of-book depths/prices, and mid-price/volume imbalances. EXEC agents may observe selected price-level queue depths. The framework allows the use of more elaborate observation structures, such as tokenized MBO windows, as agent-type handling is modular.

Multiple discrete action space parameterizations are implemented:

Market Making:
- Spread-Skew: Select a (n_bid, n_ask) tick offset tuple from the best price, optionally skewed.
- FixedQty: Place orders at fixed price offsets with constant volume.
- AvSt: Choose parameters of the Avellaneda–Stoikov control policy (skew and aggression).
Order Execution: Choose a reference price level (e.g., best bid/ask, far touch) for a fixed-volume order.
Directional: Select from {do nothing, bid at best, ask at best} with fixed order size.

Reward functions are formulated per agent type. For market making, variants include spread capture, inventory revaluation, and regularization, with explicit formulas given by Spooner & Savani (2021). Execution agent rewards penalize for both negative slippage and unexecuted shares. Inventory penalties are enforced via quadratic regularizers.

Agent Type	Observation Example	Action Example
MM	Inventory, top book, imbalance	(n_bid, n_ask) offset
EXEC	Queue lengths on ref. levels	Place at best ask
DIR	Inventory, price levels	Passive/aggressive trade

3. Reinforcement Learning Algorithms and Policy Architectures

JaxMARL-HFT adopts independent proximal policy optimization (IPPO) as the principal RL algorithm. Each agent or agent type maintains separate policy and value networks ( $\pi_{\theta}, V_{\phi}$ ). The data collection phase processes agent-trajectory rollouts in parallel, under a “centralized” simulator but with “decentralized” agent decision policies (no weight sharing across types).

Policy Networks: Two-layer GRU-based recurrent networks (hidden size 64) ingest time-series observations, outputting distributional policy logits and value estimates via MLP heads.
Loss function (per agent/type):
- Clipped PPO policy loss:
$L^{\text{CLIP}}(\theta) = E_t[\min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t)]$ - Value-function loss:

$L^{\text{VF}}(\phi) = E_t[(V_\phi(s_t) - R_t^{\text{boot}})^2]$ - Entropy bonus $H[\pi_\theta]$ is used for exploration regularization. - Full per-agent loss:

$L(\theta, \phi) = -L^{\text{CLIP}}(\theta) + c_1 L^{\text{VF}}(\phi) - c_2 H[\pi_\theta]$
Advantage estimation (GAE):

$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

$\hat{A}_t = \sum_{l=0}^{T-1} (\gamma \lambda)^l \delta_{t+l}$

No parameter sharing is performed across agent types, reinforcing type-level heterogeneity. Policy and value updates are executed on-GPU.

4. Experimentation: Datasets, Partitioning, and Hyperparameters

JaxMARL-HFT has been evaluated on over a year of Amazon (AMZN) LOBSTER MBO data ( $\approx$ 400 million messages). The environment slices overlapping episodes—each episode comprising $T=64$ RL steps, with episode starts offset by 64 steps for maximal data utilization. At each RL step, one can replay either a minimal set (1) or a batch (100) of MBO messages, with all agent actions integrated per-step. Maximum book depth is capped at 100 price levels per side.

Key experimental hyperparameterization includes:

Number of parallel rollouts: 4096 per GPU.
RL discount $\gamma=0.99$ , GAE $\lambda=0.95$ .
PPO parameters: clip $\epsilon=0.2$ , value-loss $c_1=0.5$ , entropy $c_2=0.01$ .
Adam optimizer: $3 \times 10^{-4}$ learning rate.
PPO epochs per update: 10; minibatch size: 64.
GRU hidden dimension: 64; depth: 2 layers.

Reported throughput: pure environment (evaluation) up to 350,000 steps/s (1 agent/type, 1 message/step); full IPPO pipeline up to 120,000 steps/s and scaling up to 240× that of CPU-bound reference implementations as agent count increases.

5. Benchmarking and Agent Performance

Benchmarks involve trained MM and EXEC agents, compared to strong hand-crafted baselines:

Market making baseline: Avellaneda–Stoikov optimal quotes. Learned MM exhibits average performance loss of $\approx$ 0.2 ticks per step, interpreted as being competitive with baseline strategies under identical action discretization constraints.
Order execution baseline: TWAP (time-weighted average price). Learned EXEC achieves 10–20% slippage reduction, equating to 0.15–0.25 ticks saved on average.
Two-player matrix testing: When both MM and EXEC deploy trained policies, their portfolio and slippage outcomes improve compared to any agent operating under baseline, with adversarial effects observable (i.e., EXEC agent experiences higher cost when MM employs a learned, adaptive policy).

Training curves show monotonic improvement of core metrics (MM value, EXEC slippage), with action histograms confirming risk aversion behaviors emerging for MM. Comparative trajectory plots reveal distinct temporal order patterns versus TWAP for EXEC, suggesting learned adaptation to market microstructure.

6. Significance, Extensibility, and Availability

JaxMARL-HFT demonstrates for the first time the feasibility of large-scale, high-throughput MARL research on real MBO data at the scale of hundreds of millions of messages and fully heterogeneous agent populations. It transforms both environment simulation and policy optimization into a GPU-native, fully batched workflow, enabling both basic and complex MARL (and ABM with fixed policies) studies. The approach substantially reduces the prohibitive resource constraints present in prior frameworks, notably ABIDES-gym.

The framework’s support for custom obs/action/reward schemas per agent type allows neuroscientific-grade flexibility for future extensions, including expansions to adversarial scenarios and hybrid ABM–MARL research. The open-source codebase is maintained at https://github.com/vmohl/JaxMARL-HFT and is suitable both as a benchmark baseline and as a research foundation for extensions to more complex trading and market simulation studies.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to JaxMARL-HFT.