Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 194 tok/s Pro
GPT OSS 120B 463 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning (2509.11420v1)

Published 14 Sep 2025 in q-fin.TR, cs.AI, cs.CE, cs.CL, and cs.LG

Abstract: Developing professional, structured reasoning on par with human financial analysts and traders remains a central challenge in AI for finance, where markets demand interpretability and trust. Traditional time-series models lack explainability, while LLMs face challenges in turning natural-language analysis into disciplined, executable trades. Although reasoning LLMs have advanced in step-by-step planning and verification, their application to risk-sensitive financial decisions is underexplored. We present Trading-R1, a financially-aware model that incorporates strategic thinking and planning for comprehensive thesis composition, facts-grounded analysis, and volatility-adjusted decision making. Trading-R1 aligns reasoning with trading principles through supervised fine-tuning and reinforcement learning with a three-stage easy-to-hard curriculum. Training uses Tauric-TR1-DB, a 100k-sample corpus spanning 18 months, 14 equities, and five heterogeneous financial data sources. Evaluated on six major equities and ETFs, Trading-R1 demonstrates improved risk-adjusted returns and lower drawdowns compared to both open-source and proprietary instruction-following models as well as reasoning models. The system generates structured, evidence-based investment theses that support disciplined and interpretable trading decisions. Trading-R1 Terminal will be released at https://github.com/TauricResearch/Trading-R1.

Summary

  • The paper introduces a novel framework that integrates LLM-based reasoning with volatility-adjusted reinforcement learning for evidence-based trading recommendations.
  • It employs a unique three-stage curriculum combining supervised fine-tuning and RL to stabilize structured thesis generation and decision-making.
  • Experimental results show superior risk-adjusted returns with improved Sharpe ratios and hit rates compared to baseline models.

Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning

Introduction and Motivation

Trading-R1 addresses the challenge of aligning LLMs with the requirements of professional financial trading, emphasizing structured reasoning, interpretability, and risk-aware decision-making. Traditional time-series models lack explainability, and general-purpose LLMs struggle to produce disciplined, actionable trading recommendations. Trading-R1 is designed to bridge this gap by integrating financial domain knowledge, structured thesis generation, and volatility-adjusted reinforcement learning, enabling the model to generate evidence-based investment theses and executable trade decisions.

Data Collection and Labeling

A critical component of Trading-R1 is the Tauric-TR1-DB corpus, comprising 100k samples over 18 months, 14 equities, and five heterogeneous financial data sources (technical, fundamental, news, sentiment, macro). The data pipeline emphasizes breadth (diverse tickers and sectors), depth (multi-modal features per asset-day), and robustness (randomized input composition to simulate real-world data incompleteness). This ensures high signal-to-noise ratio and generalizability across market regimes.

Labels for supervised and RL training are generated via a multi-horizon, volatility-adjusted discretization procedure. Forward returns over 3, 7, and 15 days are normalized by rolling volatility, combined with empirically determined weights, and mapped to a five-class action space (Strong Buy, Buy, Hold, Sell, Strong Sell) using asymmetric quantiles. This approach captures both short-term momentum and medium-term trends, aligns with real-world trading practices, and provides a scalable reward signal for RL.

Training Methodology

Curriculum Design

Trading-R1 employs a three-stage, easy-to-hard curriculum that interleaves supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RFT):

  • Stage I (Structure): SFT on professional thesis organization, followed by RFT to reinforce XML-tagged formatting and systematic analysis.
  • Stage II (Claims): SFT for evidence-based reasoning, RFT to ground claims with direct citations and sources, mitigating hallucinations.
  • Stage III (Decision): SFT for investment recommendation patterns, RFT with volatility-aware outcome rewards to align decisions with market dynamics.

This staged progression stabilizes intermediate reasoning, mitigates error compounding, and builds the discipline required for coherent, actionable trading outputs. Figure 1

Figure 1: Trading-R1 Training Schema illustrating the multi-stage curriculum integrating SFT and RFT for structured, evidence-based, and market-aligned reasoning.

Reverse Reasoning Distillation

Obtaining high-quality reasoning traces for SFT is challenging due to the opacity of proprietary LLM APIs. Trading-R1 introduces reverse reasoning distillation: final recommendations from black-box models are paired with input data and passed to a planner LLM, which reconstructs plausible reasoning steps. These are further elaborated by a lightweight LLM and programmatically stitched into coherent traces, yielding a synthetic dataset suitable for SFT. Figure 2

Figure 2

Figure 2: Investment Thesis Distillation from OpenAI Reasoning Models, demonstrating the extraction and reconstruction of reasoning traces for SFT targets.

Reinforcement Learning Optimization

RFT is performed using Group Relative Policy Optimization (GRPO), which stabilizes training by normalizing rewards within groups of sampled trajectories, eliminating the need for a separate value model. The reward integrates structure, evidence, and decision components, with an asymmetric penalty matrix reflecting institutional risk management priorities (e.g., heavier penalties for false bullish signals). Figure 3

Figure 3: Reinforcement learning on Thesis Structure, Statement, and Decision, showing the integration of multi-component rewards in the RL pipeline.

Figure 4

Figure 4: Trading-R1 asymmetric reward heatmap: rewards (-2.25 to 1) based on model prediction vs ground truth, with labels derived from volatility-adjusted discretization.

Experimental Results

Trading-R1 is evaluated via historical backtesting on held-out periods for major equities and ETFs. Metrics include Cumulative Return (CR), Sharpe Ratio (SR), Hit Rate (HR), and Maximum Drawdown (MDD). Baselines span small LLMs (Qwen-4B, GPT-4.1-nano), large LLMs (GPT-4.1, LLaMA-3.3), RL-enhanced models (DeepSeek, O3-mini), and ablations of Trading-R1 (SFT-only, RL-only).

Trading-R1 consistently outperforms baselines, achieving superior risk-adjusted returns and lower drawdowns. For example, on NVDA, Trading-R1 attains a Sharpe ratio of 2.72 and 8.08% return, with a hit rate of 70.0%. On SPY, it achieves a Sharpe of 1.60 and 3.34% return. SLMs and RLMs underperform due to limited parameter capacity and unguided reasoning, while general-purpose LLMs show better consistency but lack domain-specific alignment. Figure 5

Figure 5: Sharpe Ratio Heatmap comparing Trading-R1 against baselines across multiple assets, highlighting consistent improvements in risk-adjusted performance.

The staged SFT-RFT curriculum is essential: SFT enforces professional output formats and decision patterns, while RFT aligns reasoning with market outcomes. Pure RL or SFT-only variants are less effective, either drifting from financial context or overfitting to superficial heuristics.

Implementation Considerations

  • Model Architecture: Trading-R1 uses Qwen3-4B as the backbone, enabling deployment on standard commercial GPUs (8×H100/H200) and supporting long-context inputs (20–30k tokens).
  • Data Pipeline: Modular, transparent, and reproducible, facilitating adaptation to proprietary datasets and institutional requirements.
  • Reward Design: Three-stage reward system balances structure, evidence, and decision accuracy, with tunable weights for application-specific priorities.
  • Deployment: Local and private inference is feasible, supporting sensitive data processing and customizable policies (e.g., sector-specific long/short ratios, trading frequency).
  • Limitations: Hallucinations persist in long/noisy contexts; excessive RL may erode structured reasoning; training universe is biased toward large-cap, AI-driven equities; not suitable for high-frequency or fully automated trading without human oversight.

Practical and Theoretical Implications

Trading-R1 demonstrates that LLMs, when properly aligned via staged curriculum and volatility-aware RL, can generate interpretable, evidence-based investment theses and actionable trade recommendations. The framework is particularly suited for research support, structured analysis generation, and institutional applications (data vendors, sell-side/buy-side research). It enables scalable, customizable, and private deployment, augmenting human decision-making in high-throughput scenarios.

Theoretically, Trading-R1 highlights the necessity of disentangling structural and outcome rewards, sequencing reasoning scaffolds before market alignment, and leveraging synthetic reasoning traces for supervision. The staged curriculum mitigates instability and brittleness observed in prior approaches, suggesting a generalizable paradigm for aligning LLMs with complex, risk-sensitive domains.

Future Directions

  • Real-time deployment: Enhancing inference speed and sample efficiency for live trading support.
  • Offline RL variants: Improving sample efficiency and robustness in low-data regimes.
  • Expanded modalities: Integrating alternative data sources (e.g., social media, alternative asset classes) for broader domain adaptability.
  • Customization: Enabling fine-grained control over thesis structure, decision policies, and risk preferences for institutional clients.

Conclusion

Trading-R1 establishes a robust framework for financial trading with LLM reasoning, integrating structured thesis generation, volatility-aware RL, and modular data pipelines. It achieves superior risk-adjusted returns and interpretability compared to baseline models, supporting practical applications in research, data processing, and decision support. The staged curriculum and reward design offer a blueprint for aligning LLMs with complex, high-stakes domains, with future work focused on real-time deployment, expanded data integration, and enhanced customization for institutional use.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 13 posts and received 1871 likes.