Deep reinforcement learning for portfolio management (2012.13773v7)

Published 26 Dec 2020 in q-fin.CP, cs.LG, and q-fin.PM

Abstract: In our paper, we apply deep reinforcement learning approach to optimize investment decisions in portfolio management. We make several innovations, such as adding short mechanism and designing an arbitrage mechanism, and applied our model to make decision optimization for several randomly selected portfolios. The experimental results show that our model is able to optimize investment decisions and has the ability to obtain excess return in stock market, and the optimized agent maintains the asset weights at fixed value throughout the trading periods and trades at a very low transaction cost rate. In addition, we redesigned the formula for calculating portfolio asset weights in continuous trading process which can make leverage trading, that fills the theoretical gap in the calculation of portfolio weights when shorting.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a DRL framework that integrates short selling and an explicit arbitrage mechanism to dynamically optimize portfolio allocations.
It employs actor-critic algorithms with tailored state representations and reward functions that factor in transaction costs and leverage constraints.
Empirical results show excess returns and stable asset weights, indicating a robust, low-turnover strategy applicable to real-world trading.

Deep Reinforcement Learning (DRL) presents a compelling paradigm for addressing the sequential decision-making problem inherent in portfolio management. Unlike traditional methods often relying on single-period optimization or explicit forecasting models, DRL agents can learn complex, dynamic allocation strategies directly from interactions with market data, adapting to changing conditions and optimizing cumulative rewards over extended horizons. The paper "Deep reinforcement learning for portfolio management" (Deep reinforcement learning for portfolio management, 2020) explores this application, introducing specific mechanisms for short selling and arbitrage within a DRL framework.

Problem Formulation and DRL Framework

The core task is framed as a Markov Decision Process (MDP), where the DRL agent interacts with the financial market environment at discrete time steps $t$ .

State ( $s_t \in \mathcal{S}$ ): The state representation is critical and typically includes information necessary for decision-making. This encompasses current market conditions and the agent's current holdings. Common components include:
- Historical price information (e.g., closing prices, high, low, open) for the asset universe, often transformed into returns or normalized price series over a lookback window.
- Technical indicators (e.g., moving averages, RSI, MACD) derived from price/volume data.
- Volume information.
- Current portfolio weights ( $w_t$ ).
- Potentially macroeconomic indicators or market sentiment data.
- The state can be represented using vectors, matrices (e.g., price history for multiple assets), or tensors, often processed by input layers like Convolutional Neural Networks (CNNs) for spatial pattern detection across assets or Recurrent Neural Networks (RNNs), particularly LSTMs or GRUs, for temporal feature extraction.
Action ( $a_t \in \mathcal{A}$ ): The action represents the target portfolio allocation for the next period, $w_{t+1}$ $w_{t + 1}$ . The action space is typically continuous, representing the weights assigned to each asset (including cash).
- For a long-only portfolio of $n$ assets plus cash, the action space is the standard simplex: $\mathcal{A} = \{ w \in \mathbb{R}^{n+1} | \sum_{i=0}^n w_i = 1, w_i \ge 0 \}$ .
- Paper (Deep reinforcement learning for portfolio management, 2020) incorporates short selling, extending the action space to allow negative weights for non-cash assets ( $w_i < 0$ for $i=1,...,n$ ). Constraints, such as leverage limits (e.g., $\sum_{i=1}^n |w_i| \le L$ , where $L$ is the maximum leverage ratio), are often imposed. The cash weight $w_0$ typically acts as the residual ensuring the weights sum appropriately, potentially becoming negative under leverage.
Reward ( $r_t$ ): The reward function guides the learning process. It's commonly based on the change in portfolio value $V_t$ . A frequent choice is the logarithmic return:

$r_t = \log(V_{t+1} / V_t)$

Crucially, transaction costs must be incorporated to prevent the agent from learning overly active strategies. The reward is penalized based on the magnitude of the weight changes:

$r_t = \log(V_{t+1}' / V_t) - c \sum_{i=1}^n |\Delta w_i|$

where $V_{t+1}'$ is the portfolio value after transaction costs, $c$ is the transaction cost rate, and $\Delta w_i$ is the change in weight for asset $i$ . Other reward formulations might directly optimize risk-adjusted returns like the Sharpe ratio or Sortino ratio over a period, although this can increase complexity.
Policy ( $\pi(a_t|s_t)$ ): The DRL agent learns a policy $\pi$ that maps states to actions (or a distribution over actions). The goal is to find the optimal policy $\pi^*$ that maximizes the expected discounted cumulative future reward:

$J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^T \gamma^t r_t \right]$

where $\gamma \in [0, 1]$ is the discount factor.
DRL Algorithms: Given the continuous action space, Actor-Critic algorithms are well-suited. Examples include:
- Deep Deterministic Policy Gradient (DDPG)
- Twin Delayed DDPG (TD3)
- Soft Actor-Critic (SAC)
- Proximal Policy Optimization (PPO) can also be adapted for continuous spaces.
- These algorithms typically use deep neural networks to approximate the policy (actor) and the value function or state-action value function (critic).

Innovations from Paper (Deep reinforcement learning for portfolio management, 2020)

The paper introduces specific enhancements to the standard DRL portfolio management framework:

Short Selling Implementation: The inclusion of short selling requires careful handling of portfolio weights and value calculation. The paper proposes a redesigned formula for calculating portfolio asset weights in continuous trading processes, particularly addressing the theoretical gap when short positions and leverage are involved. This likely involves defining how portfolio value $V_t$ evolves considering both long and short positions and associated margin requirements, and ensuring weight calculations remain consistent during rebalancing periods, especially when using leverage. The exact formulation would be needed for precise implementation but is crucial for accurately modeling portfolios with short positions.
Arbitrage Mechanism: An explicit arbitrage mechanism is designed. While details require the full paper, this could involve:
- Adding specific state features designed to highlight potential arbitrage opportunities (e.g., price deviations in related assets, basis between futures and spot).
- Modifying the reward function to incentivize exploiting such detected opportunities.
- Structuring the agent's architecture to specifically look for and act on these patterns.
- This aims to enhance profitability beyond directional bets by capturing risk-limited profits from market inefficiencies.
Performance Characteristics: The experiments reportedly show the agent achieves excess return and maintains asset weights at relatively fixed values, leading to low transaction costs. This stability is a desirable practical property, suggesting the DRL agent learned a robust, low-turnover allocation strategy rather than a high-frequency trading approach, potentially making it more applicable in real-world portfolio management scenarios where turnover is often minimized.

Implementation Considerations

Translating this DRL approach into a working system involves several practical steps and challenges:

Market Environment Simulation: A high-fidelity backtesting environment is essential. This simulator must:
- Provide historical market data (prices, volumes) for the chosen asset universe.
- Accurately model transaction costs, including brokerage fees and market impact (slippage), especially for large orders. Slippage can be modeled as a function of trade size and asset volatility.
- Handle corporate actions (splits, dividends) correctly.
- Allow for the simulation of short selling mechanics, including margin requirements and borrowing costs (though borrowing costs are often simplified or omitted in initial research).
- Accurately implement the portfolio value calculation and rebalancing logic according to the chosen weight calculation method (e.g., the one proposed in (Deep reinforcement learning for portfolio management, 2020) for shorting).

State Representation Engineering:

The choice and preprocessing of input features significantly impact performance. Normalization (e.g., z-score) is standard.
Using returns ( $p_t/p_{t-1} - 1$ ) instead of raw prices often helps with stationarity.
Combining CNNs (for cross-asset correlations) and LSTMs (for temporal dynamics) in the DRL agent's network architecture is a common pattern.

# Example state preparation (conceptual)
def get_state(historical_data, current_weights, lookback_window):
    # historical_data: pandas DataFrame with price/volume for assets
    # current_weights: numpy array
    
    # Price features (e.g., log returns)
    price_features = np.log(historical_data['close'] / historical_data['close'].shift(1))
    
    # Technical indicators
    # sma = calculate_sma(historical_data['close'], window=20)
    # rsi = calculate_rsi(historical_data['close'], window=14)
    # feature_matrix = np.stack([price_features, sma, rsi], axis=-1) # Example
    
    # Select lookback window
    state_market_data = price_features[-lookback_window:].values # Shape: (lookback_window, num_assets)
    
    # Combine market data and portfolio weights
    state = {
        "market_data": state_market_data, # Input for CNN/LSTM
        "current_weights": current_weights # Input for later layers
    }
    return state

Action Space and Rebalancing:
- The output layer of the actor network typically uses a softmax activation for long-only portfolios to ensure weights sum to 1.
- For portfolios with shorting/leverage, the activation might be linear or tanh, followed by post-processing to enforce constraints (e.g., leverage limits, sum-to-one property if cash is dynamically adjusted). The method from (Deep reinforcement learning for portfolio management, 2020) for weight calculation under shorting needs to be integrated here.
- The actual rebalancing trades are calculated based on the difference between the target weights ( $a_t = w_{t+1}$ ) and the weights after market drift but before rebalancing ( $w_t'$ ).
Reward Function Engineering:
- Carefully balancing return maximization and transaction cost minimization is key. Too high a penalty on costs might lead to an overly static portfolio; too low might lead to excessive trading.
- Risk-adjusted rewards (e.g., differential Sharpe ratio) can be used but require careful implementation to ensure stable learning.
- The reward calculation must use the portfolio value after accounting for transaction costs resulting from the action $a_t$ .
Training and Hyperparameter Tuning:
- DRL training is notoriously sensitive to hyperparameters (learning rates, discount factor $\gamma$ , network architecture, exploration noise parameters, replay buffer size). Extensive tuning using validation sets is required.
- Techniques like entropy regularization (in PPO, SAC) can encourage exploration.
- Curriculum learning (starting with simpler tasks or fewer assets) might help.
- Given the non-stationarity of financial markets, training on rolling windows of data or using techniques for continual learning might be necessary for practical deployment.
Risk Management Overlays:
- While the DRL agent optimizes a reward function, explicit risk management rules might be needed in deployment. This could involve hard constraints on position sizes, maximum drawdown limits, or volatility targets, potentially overriding the agent's actions if thresholds are breached.

Evaluation and Benchmarking

Thorough evaluation is critical to assess the practical viability of a DRL-based portfolio management strategy.

Performance Metrics: Beyond cumulative return, evaluate:
- Risk-Adjusted Returns: Sharpe Ratio, Sortino Ratio.
- Risk Metrics: Maximum Drawdown (MDD), Volatility, Value-at-Risk (VaR), Conditional Value-at-Risk (CVaR).
- Trading Activity: Annualized Turnover, average transaction costs incurred. The low turnover reported in (Deep reinforcement learning for portfolio management, 2020) is a positive sign.
Benchmarking: Compare performance against relevant benchmarks:
- Passive benchmarks (e.g., market index, equal-weighted portfolio).
- Traditional quantitative strategies (e.g., Mean-Variance Optimization, Risk Parity, Momentum).
- Simpler DRL agents (e.g., long-only without arbitrage/shorting).
Robustness Checks:
- Out-of-Sample Testing: Evaluate on data periods not used during training or validation.
- Cross-Validation: Train and test on different time periods to assess sensitivity to market regimes.
- Sensitivity Analysis: Test how performance changes with different transaction cost assumptions, leverage limits, or asset universes.

Limitations and Challenges

Non-Stationarity: Financial markets evolve, and strategies learned on past data may not perform well in the future. Continuous adaptation or retraining is likely required.
Overfitting: DRL agents can easily overfit to the specific patterns in the training data, especially with complex models and noisy financial data. Strong regularization, large datasets, and rigorous out-of-sample testing are crucial.
Simulation vs. Reality Gap: Backtests may not perfectly capture real-world trading conditions (e.g., unexpected liquidity gaps, latency, catastrophic events).
Interpretability: Understanding the rationale behind the DRL agent's decisions can be difficult, posing challenges for trust and regulatory compliance. Techniques from explainable AI (XAI) might be applicable but are still an active research area in DRL.
Computational Cost: Training sophisticated DRL agents on large datasets can be computationally intensive, requiring significant GPU resources.

Conclusion

DRL offers a powerful, adaptive framework for portfolio management, capable of learning complex strategies directly from market data. Paper (Deep reinforcement learning for portfolio management, 2020) contributes by specifically integrating mechanisms for short selling and arbitrage, along with a revised methodology for weight calculation suitable for continuous trading under these conditions. Their finding of achieving excess returns with stable weights and low costs highlights the potential for DRL to generate practical, efficient investment strategies. However, successful implementation demands careful attention to state representation, reward design, realistic simulation including transaction costs, rigorous evaluation, and addressing inherent challenges like non-stationarity and overfitting.

PDF Markdown