Papers
Topics
Authors
Recent
Search
2000 character limit reached

FactorMiner: Alpha Discovery & Boolean Factorization

Updated 22 February 2026
  • FactorMiner is a dual-purpose framework that enables scalable alpha discovery in quantitative finance and minimal Boolean factorization in databases.
  • It leverages a modular architecture with a feature engineering operator, robust evaluation pipeline, and experience memory to ensure high-IC factors and reduced redundancy.
  • Empirical evaluations demonstrate improved statistical performance and efficiency, with strategies like the Ralph Loop enabling continuous self-evolution and optimal factor admissions.

FactorMiner is a term that refers both to a concrete self-evolving agent for financial alpha discovery and, in database theory, to minimal factorization algorithms for Boolean provenance formulas. Its most prominent instantiations are (1) a lightweight agent-based framework for scalable, memory-guided discovery of interpretable formulaic alpha factors under high redundancy constraints in quantitative finance, and (2) a system for optimally factorizating the provenance of self-join-free conjunctive queries in relational databases. Both leverage modular architectures, structured memory or plan enumeration, and algorithmic pipelines with rigorous performance guarantees and experimental validation (Wang et al., 16 Feb 2026, Makhija et al., 2021).

1. Modular Skill Architectures for Financial Alpha Mining

FactorMiner, as introduced for formulaic alpha search, encapsulates its principal functionality in a single "factor-mining" skill, internally orchestrated via three tightly coupled submodules:

  • Operator Layer (Feature Engineering): Applies symbolic formulaic expressions {α}\{\alpha\} to market tensors DRM×T×F\mathcal{D}\in\mathbb{R}^{M\times T\times F} using $60+$ statically typed operators, including basic arithmetic, rolling-window statistics, cross-sectional ranks, regression diagnostics, and compositional logic.
  • Financial Evaluation Pipeline: Implements a multi-stage filter — fast information coefficient (IC) screening, redundancy check via pairwise historical Spearman correlation, intra-batch deduplication, and full validation — using the following typified metrics:

$\IC_t(\alpha) = \mathrm{Corr}_{\mathrm{rank}}(s_t^{(\cdot)}(\alpha), r_{t+1}^{(\cdot)})$

ρ(α,g)=1TtTCorrrank(st(α),st(g))\rho(\alpha, g) = \frac{1}{|\mathcal{T}|} \sum_{t\in\mathcal{T}} \mathrm{Corr}_{\mathrm{rank}}(s_t(\alpha), s_t(g))

$\ICIR(\alpha) = \frac{\mathbb{E}_t[\IC_t(\alpha)]}{\mathrm{Std}_t[\IC_t(\alpha)]}$

  • Factor Admission: Enforces rigorous thresholds: IC 0.04\geq 0.04, maxgLρ(α,g)<0.5\max_{g\in\mathcal{L}} |\rho(\alpha,g)| < 0.5 (defaults for A-shares).

This separation of high-precision batch computation from LLM agent reasoning ensures both deterministic arithmetic correctness and scalable throughput (Wang et al., 16 Feb 2026).

2. Experience Memory and Knowledge Distillation

FactorMiner’s experience memory M=(Psucc,Pfail,S,I)\mathcal{M} = (\mathcal{P}_{\rm succ}, \mathcal{P}_{\rm fail}, \mathcal{S}, \mathcal{I}) captures actionable knowledge obtained during prior mining episodes:

  • Psucc\mathcal{P}_{\rm succ}: Templates yielding previously admitted (high-IC, low-correlation) factors.
  • Pfail\mathcal{P}_{\rm fail}: Families systematically linked to redundancy or low-diversity rejections.
  • S\mathcal{S}: Global mining state, library size, saturation indicators.
  • I\mathcal{I}: Strategic insights distilled from exploratory trajectories.

Memory operators include formation (FF), evolution (EE), and memory-conditioned retrieval (RR), which collectively guide generation by emphasizing high-yield patterns and steering clear of saturated or forbidden design spaces:

mt=R(Mt,Lt)m_t = R(\mathcal{M}_t, \mathcal{L}_t)

By leveraging structural similarity metrics (rank correlation, tree edit distance), this experience memory reduces navigation of the "Correlation Red Sea" — the rapidly shrinking set of orthogonal alpha candidates as the library grows (Wang et al., 16 Feb 2026).

3. Ralph Loop: Retrieve–Generate–Evaluate–Distill Paradigm

FactorMiner realizes a continuous self-evolutionary process through the Ralph Loop:

1
2
3
4
5
6
while |L| < K and budget remains:
    m  R(M, L)
    C  sample π(·|m) via LLM
    τ  evaluate C via skill
    L  L  admitted(τ)
    M  E(M,F(τ))

  • Retrieve: Constructs memory-guided prompts to bias candidate generation distributions π(αm)\pi(\alpha|m) toward promising or underexplored templates.
  • Generate: LLM samples candidate formulas, interpreted as expression trees over Ω\Omega.
  • Evaluate: The modular skill pipeline filters, screens, and records statistical metrics for all candidates.
  • Distill: Logging all trial outcomes, the system updates M\mathcal{M} to accelerate convergence and improve library diversity.

This cycle enables efficient, diversity-maintaining exploration of high-dimensional formula space, as confirmed by ablation: memory yields a 60% acceptance rate for high-IC factors versus 20% without memory, and rejects a higher proportion for redundancy (Wang et al., 16 Feb 2026).

4. Factor Generation Space and Statistical Evaluation

FactorMiner defines its search space as all finite-depth (\leq5–7) expression trees over the operator set Ω\Omega and canonical financial features (open,high,low,close,open, high, low, close, etc.). The combinatorial explosion is contained via:

  • Deterministic type-checking and operator arity constraints.
  • Statistical admissions criteria: minimum IC, ICIR, and strict diversity constraints.
  • Replacement heuristics allowing highly predictive, mildly redundant formulas to replace older library members if $\IC(\alpha) \geq \IC(g^*) + \Delta$.

Average pairwise correlation among admitted factors stabilizes around $0.30$, demonstrating that library expansion proceeds without degenerating into duplication (Wang et al., 16 Feb 2026).

5. Comparative Performance and Empirical Evaluation

Across major equity (CSI 500, CSI 1000, HS 300) and crypto universes, FactorMiner outperforms or matches baseline frameworks:

Dataset Method IC ICIR Avg ρ|\rho|
CSI500 Random RF 2.68% 0.25 0.13
Alpha101 Adapted 5.06% 0.43 0.21
GPLearn 6.04% 0.43 0.44
AlphaAgent 5.90% 0.46 0.32
FactorMiner 8.25% 0.77 0.31

This improvement generalizes: gains in IC typically range from 1–3%, with stability in cross-factor correlation. Out-of-sample tests on top-40 frozen libraries further corroborate robustness to overfitting and domain shift (Wang et al., 16 Feb 2026).

6. FactorMiner for Boolean Provenance: Minimal Factorization Algorithm

In database provenance, FactorMiner denotes a tool and algorithmic approach for computing minimal-size factorizations of provenance for self-join-free conjunctive queries (Makhija et al., 2021).

  • Problem: Given the monotone DNF provenance φp(Q,D)\varphi_p(Q,D), factorize into an equivalent formula with minimal literal count (variable repetitions).
  • Complexity: For queries with an "active triad," FACT(Q) is NP-complete; read-once and hierarchical cases are tractable.
  • Connections to Query Planning: Every factorization tree (FT) corresponds to assignments of minimal variable elimination orders (VEOs) across witnesses; the minimal factorization cost can be represented as an ILP:

minwvrc(vr)pw,vr\min \sum_w \sum_{v^r} c(v^r)\,p_{w,v^r}

subject to allocation constraints on assignments qw,vq_{w,v} and usage pw,vrp_{w,v^r}.

  • Exact Algorithm: Integer linear programming (ILP) yields the optimal solution for arbitrary queries; practical for moderate W|W| and mveo(Q)|mveo(Q)|.
  • Approximate Algorithm: Max-flow/min-cut (MFMC) on a flow-graph parameterized by "running-prefix" order yields polynomial-time, constant-factor approximations with optimality for all known PTIME classes.

Pipeline:

  1. Parse QQ and compute mveo(Q)mveo(Q).
  2. Enumerate witnesses WW in DD.
  3. Build ILP or MFMC instance.
  4. Solve using Gurobi/MaxFlow library.
  5. Reconstruct factorized formula (FT or QEP).

Both algorithms can recover all known tractable and read-once cases in PTIME (Makhija et al., 2021).

7. Strengths, Limitations, and Prospects

Strengths:

  • Self-evolution via experience memory enables efficient library growth under redundancy constraints.
  • Modular skill abstraction ensures deterministic, verifiable evaluation disjoint from LLM-driven exploration.
  • Empirical evidence for superior IC/ICIR performance and library diversity over both GP and RL-based baselines.

Limitations:

  • Current memory update protocol is batch-oriented; online updates for non-stationary scenarios are not implemented.
  • Transaction cost and market impact modeling are not included.
  • In database context, NP-completeness for cases with active triads implies scalability limits for factorization on complex non-hierarchical queries.

Potential Extensions:

  • Online/continual adaptation of memory for dynamic markets.
  • Multi-asset and multi-frequency mining.
  • Integration of transaction-cost and capacity-aware objectives.
  • Generalizing Boolean factorization beyond self-join-free queries.

FactorMiner frameworks provide a principled, interpretable, and empirically validated solution for formulaic alpha discovery in quantitative finance and for minimal factorization in provenance analysis, each advancing the automation and scalability of discovery under combinatorial redundancy and structural complexity constraints (Wang et al., 16 Feb 2026, Makhija et al., 2021).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FactorMiner.