Can LLM-based Financial Investing Strategies Outperform the Market in Long Run? (2505.07078v3)

Published 11 May 2025 in q-fin.TR, cs.AI, and cs.CE

Abstract: LLMs have recently been leveraged for asset pricing tasks and stock trading applications, enabling AI agents to generate investment decisions from unstructured financial data. However, most evaluations of LLM timing-based investing strategies are conducted on narrow timeframes and limited stock universes, overstating effectiveness due to survivorship and data-snooping biases. We critically assess their generalizability and robustness by proposing FINSABER, a backtesting framework evaluating timing-based strategies across longer periods and a larger universe of symbols. Systematic backtests over two decades and 100+ symbols reveal that previously reported LLM advantages deteriorate significantly under broader cross-section and over a longer-term evaluation. Our market regime analysis further demonstrates that LLM strategies are overly conservative in bull markets, underperforming passive benchmarks, and overly aggressive in bear markets, incurring heavy losses. These findings highlight the need to develop LLM strategies that are able to prioritise trend detection and regime-aware risk controls over mere scaling of framework complexity.

Abstract PDF Chat (Pro)

Summary

The paper presents FINSABER, a backtesting framework that evaluates LLM-based strategies and reveals overstated performance due to survivorship and data-snooping biases.
It uses a modular approach integrating rule-based, machine learning, deep learning, and reinforcement learning methods to assess both timing and selection strategies.
Experimental results show traditional risk-aware approaches outperform LLM methods across diverse market regimes, challenging the assumed superiority of complex models.

Can LLM-based Financial Investing Strategies Outperform the Market in the Long Run?

Introduction

LLMs appear promising for financial decision-making and investment strategy design, particularly in generating buy, hold, or sell trading actions based on financial data and sentiment analysis. However, the narrow evaluation timeframes used in recent studies may give an overly optimistic view of their effectiveness. This paper presents FINSABER, a robust backtesting framework designed to evaluate the efficacy of LLM-based timing strategies across an extensive evaluation period and a broader symbol universe. It reveals that the perceived advantages of LLMs are often overstated due to survivorship and data-snooping biases.

Framework and Methodology

FINSABER Framework

FINSABER (Financial INvestment Strategy Assessment with Bias mitigation, Expanded time, and Range of symbols) addresses several evaluation biases associated with LLM-based investing research. It incorporates a three-module structure: multi-source data ingestion, a modular strategies base, and a bias-aware two-step backtesting pipeline.

Figure 1: Overview of the FINSABER Backtest Framework. The central pipeline illustrates the backtesting process. The framework includes a Strategies Base Module (green), which covers both selection-based and timing-based strategies, and a Multi-source Data Module (yellow), integrating diverse financial data inputs.

Strategies Module

The framework accommodates both timing and selection-based strategies, integrating conventional rule-based, ML, deep learning (DL), and reinforcement learning (RL) methodologies. This modularity ensures comprehensive evaluation against various benchmarks, facilitating the adoption of custom strategies within the system.

Experimental Results

Selective Evaluation

Initially, we replicated typical selective scenario setups that showcased LLM strategy effectiveness on select symbols like TSLA and MSFT, revealing instabilities when extending evaluation periods. While LLM investors displayed certain strengths in narrow windows, they generally fell short under more rigorous assessments.

Broad and Long-term Evaluation

By utilizing the Composite setup, where the strategy selection incorporates a varied, bias-aware stock universe, the study reveals that LLM-based methods consistently fail to sustain alpha generation across prolonged periods. Traditional methodologies, often considered obsolete, consistently outperformed LLM strategies in both return and risk-adjusted metrics.

Figure 2: Average Sharpe ratio by regime for all benchmarking strategies. {Green = strong}}

Market Regime Analysis

The analysis segmented market conditions into bull, bear, and sideways regimes, highlighting that LLM strategies are prone to excessive conservatism during bull markets and inadequately controlled aggression during downturns, undermining their practicality across cycles.

Strategy Implications

The findings suggest two main focal points for future improvements: enhancing the trend detection capabilities of LLM strategies to match or surpass passive drifts during favorable market conditions, and embedding sensitivity to market regimes within their decision-making frameworks to improve risk management.

Conclusion

FINSABER offers a comprehensive framework for evaluating LLM-based financial strategies, illuminating consistent shortfalls in existing LLM approaches when scrutinized in broad, unbiased contexts. The inability of current large-scale LLMs to outperform simpler, risk-aware methodologies challenges the notion that model complexity correlates with practical investing competence. This study advocates for a strategic shift towards developing domain-aware, adaptive LLM-based financial strategies robust to diverse market environments. Future research should focus on financially efficient models to reduce the high computational costs associated with current large-scale LLM implementations.