Portfolio-Vector Memory (PVM) in Deep RL
- Portfolio-Vector Memory (PVM) is a dedicated memory structure in deep RL that stores previous portfolio allocations to manage transaction costs effectively.
- It supports asynchronous mini-batch training by decoupling state transitions, which accelerates policy optimization and stabilizes gradient flow in recurrent networks.
- Empirical results demonstrate significant improvements in risk-adjusted returns and portfolio performance, reflected in higher accumulated portfolio values and Sharpe ratios.
Portfolio-Vector Memory (PVM) is a dedicated read/write buffer introduced within deep reinforcement learning (RL) frameworks for financial portfolio management, specifically first formalized in the work of Jiang et al., "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem" (Jiang et al., 2017). PVM enables explicit handling of transaction cost dynamics, supports asynchronous mini-batch training by decoupling state transitions, and stabilizes gradient flow in recurrent neural architectures. It functions as a deterministic replay buffer that stores a chronological record of all portfolio-weight vectors produced by the trading agent, allowing both the agent’s policy and reward functions to access the previous portfolio allocation at every time step. The architecture has demonstrated marked improvements in risk-adjusted and absolute performance in RL-based financial trading across several benchmarks (Jiang et al., 2017, Li, 2024).
1. Formal Definition and Data Structure
PVM is instantiated as a matrix , where is the total number of trading periods and the number of tradable assets including a "cash" channel. Each row corresponds to the agent’s portfolio at period , constrained to the probability simplex:
At each decision step , the agent reads (the previous period's allocation) from PVM and, following policy execution, overwrites with the new portfolio vector. This mechanism enables direct Markovian state construction:
0
where 1 is the period-2 price/history tensor.
2. Initialization and Read/Write Semantics
Prior to training or back-testing, PVM is filled as follows (Jiang et al., 2017, Li, 2024):
- The entry for 3 is typically all-cash, 4.
- For 5, PVM may be initialized as uniform, 6.
During execution, the following cycle governs PVM access:
- Read: At time 7, retrieve 8.
- Policy computation: Compute 9.
- Write: Set 0.
The buffer grows dynamically with the trading horizon; in online or batch settings, 1 matches the number of observed periods. No smoothing or interpolation is applied; overwriting is hard (2) (Li, 2024).
3. Integration with Network Architectures
The PVM is incorporated with deep RL policy networks under the Ensemble of Identical Independent Evaluators (EIIE) topology, Online Stochastic Batch Learning (OSBL), and various neural network backbones (Jiang et al., 2017):
- CNN variant: 3 is concatenated or appended as an additional channel just before the softmax output head.
- RNN/LSTM variants: Price features are processed by the recurrent block; after sequence encoding, the hidden representation is concatenated with 4 prior to the output layer.
- EIIE: Each Identical Independent Evaluator stream receives both the asset-specific price-history and the corresponding previous portfolio element from PVM. All evaluators share network weights.
- OSBL: Mini-batch samples at random periods 5 read from and write to their respective PVM slots in parallel, breaking sequential dependence and enabling large-batch, non-chronological updates while maintaining correct transaction cost modeling.
The following pseudo-code fragment captures the central interface:
8
4. Impact on Learning Dynamics and Performance
PVM serves three essential purposes (Jiang et al., 2017, Li, 2024):
- Transaction cost handling: By supplying the previous portfolio to the agent, PVM allows the network to penalize large reallocations, directly internalizing transaction costs and discouraging frequent rebalancing.
- Mini-batch training: By decoupling state transitions, PVM allows fully asynchronous, parallelized mini-batch updates sampled out of sequence, greatly accelerating policy optimization.
- Gradient flow: In RNN/LSTM architectures, PVM breaks long-range gradient propagation, mitigating vanishing/exploding gradients and confining the credit assignment problem to the local time window.
Empirical evaluation in cryptocurrency markets demonstrates PVM’s criticality. In (Jiang et al., 2017), including PVM boosted final accumulated portfolio value (fAPV) by over 6 and Sharpe ratio by substantial margins compared to architectures lacking PVM. Table summaries extracted from benchmark results are shown below:
| Experiment Span | fAPV (iCNN, no PVM) | fAPV (CNN+PVM) | Sharpe (iCNN) | Sharpe (CNN+PVM) |
|---|---|---|---|---|
| 2016-09-07–10-28 | 4.542× | 29.695× | 0.053 | 0.087 |
| 2016-12-08–01-28 | 1.573× | 8.026× | 0.022 | 0.059 |
| 2017-03-07–04-27 | 3.958× | 31.747× | 0.044 | 0.076 |
In recurrent models, PVM enabled deeper network deployment and more stable learning (Jiang et al., 2017, Li, 2024).
5. Mathematical Role in Reward Computation
The RL reward at period 7 explicitly depends on the PVM entry for 8:
9
where 0 is the price-relative vector for all assets at 1, and 2 is the transaction-remainder function (a deterministic function of 3) encoding transaction costs. There is no direct loss regularization on PVM; its contribution is solely via reward coupling (Jiang et al., 2017, Li, 2024).
6. Hyperparameters and Algorithmic Considerations
Key configuration and operational parameters for PVM include (Jiang et al., 2017, Li, 2024):
- Memory slots: One per trading period; total size 4 set by history plus online steps.
- Read/write frequency: At every period (e.g., every 30 minutes in cryptocurrency experiments).
- Initialization: Uniform or all-cash; reflects initial state assumptions.
- OSBL recency bias 5: Governs probability of sampling more recent periods for mini-batches (e.g., 6).
- No separate learning rate: PVM is not trainable; policy networks are optimized (e.g., via Adam, 7).
7. Empirical Observations and Limitations
When evaluated on cryptocurrency markets, agents employing PVM achieve superior returns and risk characteristics compared to previous methods and to variants omitting the PVM buffer (Jiang et al., 2017, Li, 2024). The PVM mechanism enables strict Markovian state construction, stable transaction cost management, and scalable deep RL optimization. When transferred to stock markets, however, performance benefits were less pronounced (Li, 2024). This suggests that the structural market differences (e.g., liquidity, asset correlation, volatility) may influence the efficacy of the PVM’s transaction cost management or that further adaptation is required for domains outside cryptocurrency.
References
- "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem" (Jiang et al., 2017)
- "A Deep Reinforcement Learning Framework For Financial Portfolio Management" (Li, 2024)