- The paper presents an innovative AutoML framework using evolutionary algorithms to discover novel trading alphas that combine the strengths of formulaic and ML methods.
- It employs a sequence-based representation with specialized operators and pruning techniques to efficiently generate weakly correlated, data-driven trading signals.
- Experimental results on NASDAQ data show that AlphaEvolve achieves higher Sharpe ratios and Information Coefficients compared to traditional genetic and complex ML models.
This paper, "AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment" (2103.16196), introduces a novel AutoML-based framework called AlphaEvolve designed to discover new, effective trading signals, known as "alphas," for quantitative investment. The goal is to find alphas that not only predict stock returns accurately but can also be combined into a portfolio with weakly correlated returns, a key requirement for risk diversification in hedge funds.
The Problem: The paper highlights the challenge of mining effective alphas in quantitative investment. Traditional approaches fall into two main categories:
- Formulaic Alphas: These are simple algebraic expressions of scalar features. They generalize well and are easy to combine into weakly correlated sets, but their predictive power is limited as they only use simple, short-term features.
- Machine Learning (ML) Alphas: These are complex, data-driven models (like neural networks) that utilize high-dimensional vector and matrix features, often incorporating long-term data. They are more predictive but are too complex to easily analyze for correlations and difficult to combine into a weakly correlated set. Existing ML alpha methods often rely on strong structural assumptions, like sector-based correlations, which may not hold in volatile markets.
Furthermore, applying general AutoML frameworks like AutoML-Zero (2003.03384) to alpha mining is computationally expensive due to the large search space and the complexity of discovering deep learning architectures from scratch. Standard AutoML also treats tasks (stocks) independently, failing to leverage relationships between stocks.
The Proposed Solution: AlphaEvolve
AlphaEvolve proposes a new class of alphas and an evolutionary algorithm to discover them. The new alphas aim to combine the strengths of formulaic and ML alphas: they can model scalar, vector, and matrix features, are data-driven using long-term features, yet are structured in a way that facilitates mining into weakly correlated sets.
An alpha in AlphaEvolve is represented as a sequence of operations, each consisting of an operator (OP), input operand(s), and an output operand. Operands can be scalars (s), vectors (v), or matrices (m). Special operands include the input feature matrix (m0), output label (s0), and prediction (s1). Each alpha has three components:
Setup()
: Initializes operands.
Predict()
: Generates the prediction (s1) based on operations.
Update()
: Updates parameters that are learned during training and used during inference (allowing the use of long-term data).
The AlphaEvolve framework uses an evolutionary algorithm:
- Initialization: Starts with a parent alpha (potentially a domain-expert-designed one). A population of candidate alphas is generated by mutating the parent. Mutations involve randomizing operands/operators, inserting operations, or removing operations.
- Evaluation: Each candidate alpha is evaluated on a set of tasks (stocks) using a validation set (Sv). The primary fitness score is the Information Coefficient (IC), which measures the correlation between predicted and actual stock returns across all tasks at each time step, averaged over time.
- Selection: A tournament selection process chooses a new parent alpha from a random subset of the population based on the highest fitness score.
- Evolution: A new population is generated by mutating the new parent and replacing the oldest alpha in the previous population. This process iterates for a fixed time budget.
- Weak Correlation Mining: During the evolutionary process, candidate alphas are pruned if their predicted portfolio returns on the validation set are highly correlated (above a threshold, e.g., 15% Pearson correlation) with alphas already found in the set of best alphas. This ensures the discovery of a set of weakly correlated signals.
Novel Operators and Optimizations:
To enhance the framework, AlphaEvolve introduces specific operators and an optimization technique:
- ExtractionOps: These operators extract scalar or vector features from the input feature matrix (X). This helps guide the evolutionary process towards the new alpha class by augmenting initial alphas with potentially predictive scalar inputs, making them more likely to be selected and refined.
- RelationOps: These operators model relationships between stocks by allowing operations to use inputs calculated from related tasks (stocks in the same sector or industry) at the same time step.
-
RankOp
: Ranks the input operand among all stocks in the current task set (FK).
-
RelationRankOp
: Ranks the input operand among stocks in the same sector/industry (FI).
-
RelationDemeanOp
: Calculates the difference between the input operand and the mean of those in the same sector/industry (FI).
These operators inject relational domain knowledge selectively without requiring strong structural assumptions like graph neural networks.
- Pruning Technique: To improve efficiency, AlphaEvolve prunes redundant operations and alphas before evaluation.
- Redundant Operations: Operations whose output operand does not contribute to the final prediction (s1) are removed. This is done by representing the alpha as a graph and tracing dependencies backward from s1.
- Redundant Alphas: Entire alphas are considered redundant and pruned if the input feature matrix (m0) is not used in the calculation chain leading to the prediction (s1).
- Fingerprinting: A fingerprint is generated from the pruned alpha structure. This fingerprint is used to check a cache for pre-computed fitness scores, avoiding redundant evaluations of structurally identical or equivalent alphas. This is more efficient than fingerprinting based on predictions after evaluation, especially with a large number of tasks.
Experimental Evaluation and Findings:
The framework was evaluated on 5 years of NASDAQ stock price data (2013-2017), using 1026 stocks after filtering. The data was split into training, validation, and test sets.
- Metrics: Performance was measured using Information Coefficient (IC) and Sharpe Ratio (SR) on a long-short portfolio strategy. Correlation between the portfolio returns of different alphas was also tracked.
- Baselines: Compared against a genetic algorithm (
alpha_G
) and complex ML models (Rank_LSTM
, RSR
). Different initializations for AlphaEvolve were tested (alpha_AE_D
initialized with a domain expert alpha, alpha_AE_NOOP
no initialization, alpha_AE_R
random initialization, alpha_AE_NN
neural network initialization).
- Weak Correlation Mining: AlphaEvolve consistently outperformed the genetic algorithm in mining weakly correlated alphas over multiple rounds, maintaining higher Sharpe ratios and ICs while adhering to correlation cutoffs. The genetic algorithm's performance deteriorated significantly with increasing correlation constraints.
- Effectiveness of Initializations: Initializing AlphaEvolve with a well-designed domain expert alpha (
alpha_AE_D
) generally led to better performance, demonstrating the framework's ability to leverage existing knowledge.
- Comparison with ML Alphas: AlphaEvolve's generated alphas (
alpha_AE_D_0
, alpha_AE_NN_1
) achieved significantly higher Sharpe ratios and ICs than complex ML models like Rank_LSTM
and RSR
on the NASDAQ dataset. This was attributed to AlphaEvolve's ability to find alphas better suited to the noisy nature of NASDAQ data without imposing rigid relational structures that may not hold.
- Study of Evolved Alphas: Analysis of the discovered alphas (
alpha_AE_D_0
, alpha_AE_NN_1
, alpha_AE_R_2
, alpha_AE_D_3
, alpha_AE_B0_4
) showed they employ combinations of temporal features, historical data stored as parameters, and relational information (via RelationOp
in alpha_AE_NN_1
). Some alphas show conditional logic that can simplify them to formulaic alphas under certain conditions.
- Ablation Study of Update Function: Removing the parameter-updating function (
_P
variants) generally decreased IC, confirming the effectiveness of using long-term historical data as parameters for improving predictive power. The effect on Sharpe ratio was mixed, highlighting the difference between overall ranking quality (IC) and top/bottom stock selection (SR).
- Efficiency of Pruning: The pruning technique drastically increased the number of unique alphas searched within the time budget compared to a baseline using prediction-based fingerprinting, demonstrating its effectiveness and efficiency, particularly for problems with many tasks like stock prediction.
Practical Implications:
AlphaEvolve provides a data-driven framework for quantitative investment practitioners to automatically discover novel trading signals that are potentially more effective and diversifiable than traditional methods.
- It automates the process of finding alphas that combine the benefits of simple formulaic alphas and complex ML models.
- It specifically addresses the crucial requirement of generating sets of weakly correlated alphas for risk management, a major challenge in quantitative finance.
- The selective injection of relational domain knowledge allows leveraging market structure when useful but avoids detrimental assumptions in volatile markets.
- The efficiency improvements from the pruning technique make the alpha search process more feasible within practical time budgets.
In conclusion, AlphaEvolve successfully demonstrates the potential of using AutoML and evolutionary algorithms to discover a new class of high-performing, weakly correlated alphas, offering an automated solution for low-risk investments with high returns.