Trading-R1: Structured Financial Reasoning
- Trading-R1 is a large-scale financial reasoning model that integrates structured thesis composition and volatility-aware trading to yield transparent, risk-adjusted decisions.
- It employs a multi-stage curriculum combining supervised fine-tuning with reinforcement learning, ensuring that decision-making aligns progressively with market realities.
- Trained on diversified financial data and evaluated via comprehensive backtesting metrics, Trading-R1 delivers improved cumulative returns and reduced drawdowns.
Trading-R1 is a large-scale financial reasoning model explicitly designed to bridge the gap between natural language financial analysis and disciplined, risk-adjusted, interpretable trading decisions. Built on a Qwen3-4B backbone, Trading-R1 integrates structured thesis composition, evidence-grounded claims, and volatility-adjusted trading decisions within a unified framework, combining supervised fine-tuning with reinforcement learning via a multi-stage curriculum. The model is trained and evaluated on the Tauric-TR1-DB corpus, which spans 100,000 samples over 18 months and encapsulates 14 equities and five heterogeneous financial data sources. Trading-R1 achieves improved risk-adjusted returns and lower drawdown compared to both general instruction-following LLMs and reasoning models, while maintaining high standards for decision transparency and explainability.
1. Model Structure and Reasoning Scaffold
Trading-R1’s architecture fuses an autoregressive LLM backbone (Qwen3-4B) with a structured reasoning scaffold that mirrors the workflow of professional financial analysts. The model employs a multi-stage, easy-to-hard curriculum:
- Stage I (Structure): The model outputs are formatted with explicit XML-style tags (e.g.,
>
,<fundamentals>
,<technical>
,<conclusion>
), enforcing modularity and logical sequence in reasoning. > > - Stage II (Claims): Each investment thesis section is populated with quotes, references, and sources in an “opinion-quote-source” format, requiring the model to ground every claim in observable data, such as analyst notes, earnings, technical signals, or macroeconomic indicators. > > - Stage III (Decision): The structured thesis is mapped to a discrete trading action—Strong Sell, Sell, Hold, Buy, or Strong Buy—determined via volatility-driven discretization so that action labels reflect both the model’s directional conviction and the prevailing market’s risk regime. > > A critical feature is the use of @@@@1@@@@: even if only a model’s final recommendation is available, intermediate steps are reconstructed using a planning LLM and incorporated into the training process. This scaffold is central to both effective supervised training and RL fine-tuning. > > ## 2. Training Procedure and Curriculum > > Training proceeds in a two-phase pipeline: supervised fine-tuning (SFT) followed by reinforcement learning (RL) fine-tuning, both orchestrated via an easy-to-hard curriculum. > > - SFT: The SFT phase utilizes high-quality, human-like investment theses obtained by reverse reasoning distillation from black-box LLMs. Each stage of the curriculum presents progressively harder reasoning tasks (structure → claim substantiation → market-aligned decision). > > - RL Fine-tuning: After SFT, Trading-R1 undergoes RL fine-tuning using Group Relative Policy Optimization (GRPO), a PPO variant. At each iteration, multiple candidate outputs for each prompt are sampled, and group-relative advantage estimates are computed. The optimization objective is: > > > > where is the likelihood ratio for candidate at step , is the group-relative advantage, and is the SFT reference model. The reward is a weighted sum for format compliance, evidence quality, and alignment with market outcome (with decisions binned adaptively according to volatility). > > The curriculum ensures model stability, progressively increasing the “hardness” of tasks so that accurate market-aligned decision rules are formed only after disciplined evidence-based reasoning is demonstrated. > > ## 3. Multi-Source Data Utilization > > The Tauric-TR1-DB corpus underpins Trading-R1’s capabilities. It integrates: > > - Technical market data: Price series with technical indicators (RSI, MACD, moving averages, Bollinger bands, etc.), capturing both trend and mean reversion signals. > > - Company fundamentals: Comprehensive financials from filings—income, margin, cash flow, and balance sheet analysis. > > - Macroeconomic data: CPI, GDP, Fed funds rate, and related macro variables, acknowledging top-down drivers of asset returns. > > - Structured and unstructured news: News is time-segmented over short, medium, and long windows to ensure recency-aware reasoning; sentiment is aggregated from insiders, analyst ratings, and options activity. > > - Additional structured data: Insider transactions, analyst forecasts, and broader sentiment proxies are included. > > All data is time-aligned, cleaned, and preprocessed using aggressive compaction to fit token/sequence budgets, and is used both for thesis composition and grounding model outputs. > > ## 4. Evaluation Methodology and Quantitative Results > > Trading-R1 is evaluated in a backtesting framework covering six major equities and ETFs with metrics standard to academic finance: > > - Cumulative Return (CR): Overall capital appreciation. > > - Sharpe Ratio (SR): Risk-adjusted excess return. > > - Hit Rate (HR): Proportion of correct directional predictions. > > - Maximum Drawdown (MDD): Largest observed decline from peak, reflecting downside risk. > > Trading-R1 delivers superior results on both risk-adjusted return and drawdown compared to instruction-following LLMs and prior “reasoning-LMs.” The model outperforms both open-source and proprietary black-box models in terms of cumulative return and Sharpe ratio, while also sharply reducing maximum drawdown. The experimental hierarchy—SLM < RLM < general LLM < Trading-SFT ≈ Trading-RFT < Trading-R1—shows consistent incremental gain from the curriculum and the RL fine-tuning paradigm. > > ## 5. Interpretability and Structured Output > > Interpretability is central to Trading-R1’s design. Structured output is enforced: > > - XML-Tagged Reasoning: Each investment thesis is output with tags ({<think>}, {<fundamentals>}, {<technical>}, {<conclusion>}), making the logical flow and supporting evidence explicit. > > - Cited Evidence: Each claim references a specific data source (analyst comment, news snippet, filing, or technical event), facilitating auditability and fact-checking. > > - Verifiable Decision Chain: The model’s trading recommendation is always transparently connected to its supporting analysis, with volatility-adjusted rationale. > > This approach provides an “audit trail” matching professional analyst workflows, supporting both institutional adoption and regulatory compliance needs. > > ## 6. Practical Applications > > Trading-R1’s technology is suited for a spectrum of high-value financial applications: > > - Research and Data Processing: Automated production of structured research notes, market digests, or in-depth due diligence reports. > > - Buy-Side Decision Support: Integration into portfolio management pipelines to provide reasoned trade recommendations or thesis vetting on high-throughput asset universes. > > - Local/Private Deployment: The model’s compact (4B) parameter count enables private cloud or on-premises deployment for hedge funds, banks, or asset managers requiring data privacy. > > - Customizable Risk/Action Protocols: The volatility-driven decision binning and adjustable reward structure allow tailoring to firm-specific risk, execution, or asset class needs. > > ## 7. Research Directions and Model Extension > > Trading-R1 opens several avenues for further research and refinement: > > - Real-Time Adaptation: Extension to process live data streams and generate intraday or event-driven as well as end-of-day recommendations. > > - Sample-Efficient RL: Improvement of offline RL techniques for more rapid adaptation to changing market environments and rare events. > > - Expanded Modalities: Inclusion of alternative data sources (social media, geospatial, credit card tracking), either as supporting evidence or direct inputs. > > - Reward Engineering: Continued refinement of the multi-part reward to better balance structure, evidence grounding, and risk-sensitive trading outcomes. > > - Advanced Interpretability: Enhanced citation and evidence distinction (e.g., distinguishing empirical observation from theoretical opinion), further increasing auditability. > > --- > > Trading-R1 represents an integrated advance in financial LLMing, bringing together structured reasoning, evidence-based thesis generation, and volatility-aware decision-making within a reinforcement learning framework. Its curriculum-based development and emphasis on output transparency position it for rigorous institutional use, while its modularity and extensibility make it a research platform for the future of interpretable AI-driven trading (Xiao et al., 14 Sep 2025).