Papers
Topics
Authors
Recent
2000 character limit reached

Deep reinforcement learning for optimal trading with partial information

Published 31 Oct 2025 in q-fin.TR, q-fin.CP, and stat.ML | (2511.00190v1)

Abstract: Reinforcement Learning (RL) applied to financial problems has been the subject of a lively area of research. The use of RL for optimal trading strategies that exploit latent information in the market is, to the best of our knowledge, not widely tackled. In this paper we study an optimal trading problem, where a trading signal follows an Ornstein-Uhlenbeck process with regime-switching dynamics. We employ a blend of RL and Recurrent Neural Networks (RNN) in order to make the most at extracting underlying information from the trading signal with latent parameters. The latent parameters driving mean reversion, speed, and volatility are filtered from observations of the signal, and trading strategies are derived via RL. To address this problem, we propose three Deep Deterministic Policy Gradient (DDPG)-based algorithms that integrate Gated Recurrent Unit (GRU) networks to capture temporal dependencies in the signal. The first, a one -step approach (hid-DDPG), directly encodes hidden states from the GRU into the RL trader. The second and third are two-step methods: one (prob-DDPG) makes use of posterior regime probability estimates, while the other (reg-DDPG) relies on forecasts of the next signal value. Through extensive simulations with increasingly complex Markovian regime dynamics for the trading signal's parameters, as well as an empirical application to equity pair trading, we find that prob-DDPG achieves superior cumulative rewards and exhibits more interpretable strategies. By contrast, reg-DDPG provides limited benefits, while hid-DDPG offers intermediate performance with less interpretable strategies. Our results show that the quality and structure of the information supplied to the agent are crucial: embedding probabilistic insights into latent regimes substantially improves both profitability and robustness of reinforcement learning-based trading strategies.

Summary

  • The paper demonstrates that incorporating probabilistic regime information using GRU-enhanced DDPG models significantly improves trading performance.
  • It compares three algorithms – hid-DDPG, prob-DDPG, and reg-DDPG – under simulated Markovian regimes to evaluate efficacy with cumulative rewards as a performance metric.
  • The study validates findings with NASDAQ pair-trading experiments, showing that the prob-DDPG approach achieves superior profitability and interpretability.

Deep Reinforcement Learning for Optimal Trading with Partial Information

Introduction

The use of reinforcement learning (RL) in financial applications, particularly for devising optimal trading strategies based on latent market information, presents a significant advancement in the field. The focal point of this research is the application of RL to scenarios where a trading signal is characterized by an Ornstein-Uhlenbeck process with regime-switching dynamics. Leveraging recurrent neural networks (RNNs), specifically Gated Recurrent Units (GRUs), this study aims to enhance the extraction of latent information from market signals to inform trading decisions.

Methodology

This research explores the problem of deriving optimal trading strategies from partially observable market information through three distinct Deep Deterministic Policy Gradient (DDPG) algorithms enhanced by GRUs. These algorithms address the challenges posed by latent parameters influencing mean reversion, speed, and volatility, which are commonly encountered in financial markets.

  1. hid-DDPG: This approach directly integrates the hidden states from the GRU into the RL framework, allowing the model to intuitively learn trading strategies without explicitly identifying market regimes.
  2. prob-DDPG: This two-step method involves initially estimating the probabilities of different market regimes using a structured classification approach, followed by using these probabilities to guide trading strategy development.
  3. reg-DDPG: Similar to prob-DDPG, but instead of regime classification, it predicts the next signal value to assist in optimizing trading actions.

Through extensive simulations, each algorithm was tested under varying complexity levels of Markovian regime dynamics. The empirical results in synthetic environments provided insights into the efficacy of embedding probabilistic regime information in improving profitability and interpretability of trading strategies.

Results

The results highlight that the prob-DDPG approach consistently outperforms the other methods, particularly in scenarios with complex regime dynamics. The ability to incorporate probabilistic insights about latent regimes led to superior cumulative rewards and more interpretable strategies. In contrast, the reg-DDPG demonstrated limited additional benefits over the one-step hid-DDPG, primarily due to the added complexity in predicting the next trading signal without explicit regime information. Figure 1

Figure 1: Comparison of the cumulative rewards for the different DDPG approaches when theta_t is a Markov chain.

The analysis demonstrated that the quality and structure of the information provided to the RL agent are paramount. The implementation of regime probabilities as an input feature substantially enhances the effectiveness of RL-based trading models, a finding supported by both simulated and real-world market data.

Application to Real Market Data

To evaluate the real-world applicability, the algorithms were applied to a pair-trading strategy using historical data from the NASDAQ for Intel (INTC) and SMH. The research found that utilizing RL to trade co-integrated assets based on a mean-reverting signal derived from their combined price dynamics resulted in positive performance, with the prob-DDPG method achieving the highest cumulative rewards. Figure 2

Figure 2: Comparison of realised cumulative rewards for the hid-DDPG, prob-DDPG and rolling Z-score strategy on the testing set.

Conclusion

The study provides comprehensive evidence that embedding structured regime information into RL architectures significantly enhances performance in trading applications. The prob-DDPG method, by leveraging probabilities of regime presence, achieves higher accuracy and profitability, demonstrating its practical superiority. These findings suggest promising avenues for future research, particularly in exploring multi-agent scenarios and further integrating financial domain knowledge into RL frameworks to improve robustness and adaptability in dynamic market environments.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper explores how to teach a computer to trade in the stock market using deep reinforcement learning (a kind of AI that learns by trying, getting rewards, and improving). The key idea is to help the AI make good decisions even when some important market information is hidden. The authors design and test different ways to give the AI “clues” about these hidden parts so it can trade better.

Key Objectives

The researchers aim to answer simple questions:

  • Can an AI learn profitable trading strategies when the market behaves in different “moods” (regimes) that we can’t see directly?
  • What kind of information should we give the AI: raw patterns, probabilities of hidden regimes, or predictions of the next price?
  • Which approach leads to higher profits, clearer decisions, and stronger performance both in simulations and real market data?

Methods and Approach

Think of the trading signal (the number the AI uses to decide when to buy or sell) like a rubber band around a “normal” level: it stretches away but tends to snap back. This behavior is called “mean reversion.” The paper models the signal with an Ornstein–Uhlenbeck process, which is a fancy term for a mean-reverting signal with random wiggles. Sometimes the market’s “normal level,” its speed of snapping back, and its volatility change depending on the market’s hidden “mood” (regime). These regimes switch over time like seasons: you don’t always know which season you’re in, but patterns give you hints.

The trading setup:

  • The AI chooses how much to hold (inventory) at each step.
  • Its reward is the profit from the signal’s movement minus trading costs.
  • It’s risk-neutral (it focuses on expected profit) and uses a discount factor (it cares more about money earned sooner than later).

The learning tools:

  • Reinforcement Learning (RL): Like a video game player, the AI tries actions, sees rewards, and learns a strategy to maximize profits.
  • Deep Deterministic Policy Gradient (DDPG): An RL method with two parts:
    • Actor: proposes an action (how much to buy or sell).
    • Critic: judges how good that action is.
  • Gated Recurrent Unit (GRU): A neural network that remembers past information (like a short-term memory) to understand time patterns in the signal.

They test three approaches to feeding information into the RL agent. Here’s a brief summary, using an everyday analogy:

  • hid-DDPG (one-step): The AI uses the GRU’s internal “memory state” directly. Think of it as giving the AI a gut feeling based on recent history.
  • prob-DDPG (two-step): First, a GRU estimates the probabilities of the hidden market regimes (like saying, “It’s 70% likely we’re in a high mean-reversion mood”). Then the AI uses these probabilities to make trading decisions.
  • reg-DDPG (two-step): First, a GRU predicts the next signal value. Then the AI uses that prediction to trade.

They evaluate these methods in two ways:

  • Simulations: Artificial markets with different complexity (different regimes).
  • Real data: Pair trading in equities (trading the relationship between two stocks that tend to move together, betting that their difference will “snap back” to normal).

Main Findings

The results are clear and consistent:

  • prob-DDPG (using regime probabilities) performs best. It earns the highest cumulative rewards and makes more understandable decisions (you can see how the AI reacts to being in different regimes).
  • hid-DDPG (using GRU hidden states) does okay but not as well. Its strategies are harder to interpret.
  • reg-DDPG (using next-step predictions) gives only small benefits and often underperforms compared to prob-DDPG.

The big lesson: what you feed the AI matters. Giving it structured, interpretable information about the hidden regimes (probabilities) helps it trade more profitably and robustly than just giving it raw patterns or simple price forecasts.

Implications and Impact

This research suggests a practical path for building better trading AIs:

  • Don’t just predict prices—teach the AI about the market’s hidden “moods” with probabilities. This makes strategies stronger and easier to understand.
  • The approach works across both simulated and real markets (like pair trading).
  • Beyond finance, the idea applies to any problem where important factors are hidden but can be estimated (for example, energy demand, traffic flow, or health monitoring). Supplying the right kind of information—especially interpretable probabilities—can make learning systems smarter and more reliable.

In short, if you want an AI to make good decisions in a noisy, changing world, it helps to give it the right clues about what’s going on behind the scenes.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 7 likes about this paper.