- The paper presents FINSABER, a backtesting framework that evaluates LLM-based strategies and reveals overstated performance due to survivorship and data-snooping biases.
- It uses a modular approach integrating rule-based, machine learning, deep learning, and reinforcement learning methods to assess both timing and selection strategies.
- Experimental results show traditional risk-aware approaches outperform LLM methods across diverse market regimes, challenging the assumed superiority of complex models.
Introduction
LLMs appear promising for financial decision-making and investment strategy design, particularly in generating buy, hold, or sell trading actions based on financial data and sentiment analysis. However, the narrow evaluation timeframes used in recent studies may give an overly optimistic view of their effectiveness. This paper presents FINSABER, a robust backtesting framework designed to evaluate the efficacy of LLM-based timing strategies across an extensive evaluation period and a broader symbol universe. It reveals that the perceived advantages of LLMs are often overstated due to survivorship and data-snooping biases.
Framework and Methodology
FINSABER Framework
FINSABER (Financial INvestment Strategy Assessment with Bias mitigation, Expanded time, and Range of symbols) addresses several evaluation biases associated with LLM-based investing research. It incorporates a three-module structure: multi-source data ingestion, a modular strategies base, and a bias-aware two-step backtesting pipeline.
Figure 1: Overview of the FINSABER Backtest Framework. The central pipeline illustrates the backtesting process. The framework includes a Strategies Base Module (green), which covers both selection-based and timing-based strategies, and a Multi-source Data Module (yellow), integrating diverse financial data inputs.
Strategies Module
The framework accommodates both timing and selection-based strategies, integrating conventional rule-based, ML, deep learning (DL), and reinforcement learning (RL) methodologies. This modularity ensures comprehensive evaluation against various benchmarks, facilitating the adoption of custom strategies within the system.
Experimental Results
Selective Evaluation
Initially, we replicated typical selective scenario setups that showcased LLM strategy effectiveness on select symbols like TSLA and MSFT, revealing instabilities when extending evaluation periods. While LLM investors displayed certain strengths in narrow windows, they generally fell short under more rigorous assessments.
Broad and Long-term Evaluation
By utilizing the Composite setup, where the strategy selection incorporates a varied, bias-aware stock universe, the study reveals that LLM-based methods consistently fail to sustain alpha generation across prolonged periods. Traditional methodologies, often considered obsolete, consistently outperformed LLM strategies in both return and risk-adjusted metrics.
Figure 2: Average Sharpe ratio by regime for all benchmarking strategies. {Green = strong}}
Market Regime Analysis
The analysis segmented market conditions into bull, bear, and sideways regimes, highlighting that LLM strategies are prone to excessive conservatism during bull markets and inadequately controlled aggression during downturns, undermining their practicality across cycles.
Strategy Implications
The findings suggest two main focal points for future improvements: enhancing the trend detection capabilities of LLM strategies to match or surpass passive drifts during favorable market conditions, and embedding sensitivity to market regimes within their decision-making frameworks to improve risk management.
Conclusion
FINSABER offers a comprehensive framework for evaluating LLM-based financial strategies, illuminating consistent shortfalls in existing LLM approaches when scrutinized in broad, unbiased contexts. The inability of current large-scale LLMs to outperform simpler, risk-aware methodologies challenges the notion that model complexity correlates with practical investing competence. This study advocates for a strategic shift towards developing domain-aware, adaptive LLM-based financial strategies robust to diverse market environments. Future research should focus on financially efficient models to reduce the high computational costs associated with current large-scale LLM implementations.