FinRL-DeepSeek: LLM-Infused Risk-Sensitive Reinforcement Learning for Trading Agents (2502.07393v1)

Published 11 Feb 2025 in q-fin.TR

Abstract: This paper presents a novel risk-sensitive trading agent combining reinforcement learning and LLMs. We extend the Conditional Value-at-Risk Proximal Policy Optimization (CPPO) algorithm, by adding risk assessment and trading recommendation signals generated by a LLM from financial news. Our approach is backtested on the Nasdaq-100 index benchmark, using financial news data from the FNSPID dataset and the DeepSeek V3, Qwen 2.5 and Llama 3.3 LLMs. The code, data, and trading agents are available at: https://github.com/benstaf/FinRL_DeepSeek

Summary

The paper presents FinRL-DeepSeek, integrating LLM-extracted stock recommendation and risk scores from financial news into risk-sensitive deep reinforcement learning for automated trading.
It extends PPO to a risk-aware CVaR-PPO version, using LLM scores to modulate trading actions and adjust the risk-sensitive trajectory return.
Empirical results show LLM signal infusion improves performance, particularly for the risk-sensitive variant (CPPO), and enhances robustness in bearish markets.

The paper presents a technical framework that integrates LLM (LLM LLM) derived signals into risk-sensitive deep reinforcement learning (RL) algorithms for automated trading. It extends existing algorithms by incorporating both stock recommendation scores and risk assessments extracted from financial news into the RL update process. The following points summarize the key contributions and findings in detail:

Algorithmic Integration

LLM-based Signal Extraction
- Stock Recommendation Score ( $S_f$ ): Quantifies the bullish or bearish recommendation for a specific stock on a discrete scale (1 to 5).
- Risk Assessment Score ( $R_f^i$ ): Categorizes market risk for individual stocks, also on a 1 to 5 scale.
- These scores are integrated into the trading decision process. For trading actions, the baseline action $a_t$ is modulated as:

$a_t^{\text{mod}} = S_f \cdot a_t$

where selections of $S_f$ close to 1 preserve algorithmic stability while adjusting exposure. For risk adjustment, the aggregated risk score is computed as:

$R_f = \sum_i w_i R_f^i$

and used to scale the trajectory return in the risk-sensitive objective:

$D_{R_f}(\pi_\theta) = R_f \cdot D(\pi_\theta)$

with $w_i$ being the portfolio weights.

Extension of PPO Algorithms The paper extends the classical Proximal Policy Optimization (PPO) algorithm by formulating a risk-aware version, Conditional Value at Risk-PPO (CVaR-PPO). The standard PPO objective is:

$L_{\text{PPO}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) \cdot A_t,\; \text{clip}\left(r_t(\theta), 1-\epsilon,\, 1+\epsilon\right) \cdot A_t\right)\right]$

where: - $r_t(\theta) = \frac{\pi_\theta(a_t\,|\,s_t)}{\pi_{\theta_{\text{old}}}(a_t\,|\,s_t)}$ is the probability ratio, - $A_t$ stands for the advantage estimate, - $\epsilon$ is the clip parameter. The CVaR-PPO formulation augments this by penalizing tail-risk outcomes using an additional term with Lagrange multiplier $\lambda$ and a CVaR threshold $\eta$ , where the risk-sensitive loss component forces consideration of worst-case returns defined at a given confidence level $\alpha$ .

Data and Implementation

The dataset used is the FNSPID dataset, originally comprising 15.7 million time-aligned news records from 1999–2023. A representative sampling reduces the volume to approximately 2 million records, which is critical due to LLM API cost constraints.
Three LLMs—DeepSeek V3, Qwen 2.5 72B, and Llama 3.3 70B—are employed to extract the recommendation and risk scores using specifically designed prompts.
The LLM incorporation is implemented both in modifying trading actions (through $S_f$ ) and adjusting risk by modifying the trajectory return in the CVaR-PPO framework.

Empirical Evaluation

Training and Backtesting Protocols:
- Early experiments with 500k training steps (across 3 years of training, trading in 2023) indicate that the infusion of LLM-derived signals consistently improves cumulative returns, although initial integration is insufficient to outperform the Nasdaq-100 benchmark.
- Longer training durations (e.g., 400k training steps spanning 6 years) yield significant performance improvements particularly evident for risk-sensitive variants (CPPO), despite baseline PPO displaying increased volatility.
Numerical Results and Robustness:

After 2 million training steps (100 epochs of 20k steps each), the reported metrics are summarized as follows:

PPO after 100 epochs:
- Information Ratio: 0.0100
- CVaR: -0.0394
- Rachev Ratio: 1.0637
CPPO after 100 epochs:
- Information Ratio: -0.0148
- CVaR: -0.0439
- Rachev Ratio: 1.0404
LLM-Infused Variants:
- For PPO-DeepSeek (LLM-infused) at 10% infusion strength, the metrics indicate an Information Ratio of -0.0093, CVaR -0.0338, and Rachev Ratio 0.9890.
- In contrast, CPPO-DeepSeek at 10% infusion yields an Information Ratio of 0.0078, CVaR -0.0437, and Rachev Ratio 0.9818.
- Experiments adjusting the infusion strength (ranging from 0.1% to 10%) reveal that while stronger LLM infusion deteriorates PPO performance, it enhances CPPO performance.
- Market Regime Sensitivity:

The analysis suggests that PPO performs comparatively better under bullish conditions, whereas CPPO-DeepSeek exhibits more robust behavior in bearish markets. This performance transition is notably aligned with market regime shifts observed around late 2021 during geopolitical and economic turbulence.

Future Directions

The paper also outlines several avenues for further research:

Memory Optimization: The increase in training steps correlates with a dramatic rise in RAM requirements (from 16 GB at 500k steps to 128 GB at 2 million steps), signaling the need for efficiency improvements.
Decision Timescale Adjustment: Shortening the decision-making timescale may better capture rapid market dynamics.
Enhancement of News Signal Quality: Refining the extraction and quality of financial news features is expected to drive further performance improvements.

This comprehensive integration of LLM-derived signals into risk-sensitive reinforcement learning represents an incremental yet significant contribution to the field of algorithmic trading, where it rigorously combines qualitative information from news with quantitative trading strategies.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/QFinancePapers/status/1889646016732827784

https://twitter.com/CodyOutcast/status/1890355336642535866