Signal or Noise in Multi-Agent LLM-based Stock Recommendations?

Published 19 Apr 2026 in q-fin.PM, cs.AI, and q-fin.ST | (2604.17327v1)

Abstract: We present the first portfolio-level validation of MarketSenseAI, a deployed multi-agent LLM equity system. All signals are generated live at each observation date, eliminating look-ahead bias. The system routes four specialist agents (News, Fundamentals, Dynamics, and Macro) through a synthesis agent that issues a monthly equity thesis and recommendation for each stock in its coverage universe, and we ask two questions: do its buy recommendations add value over both passive benchmarks and random selection, and what does the internal agent structure reveal about the source of the edge? On the S&P 500 cohort (19 months) the strong-buy equal-weight portfolio earns +2.18%/month against a passive equal-weight benchmark of +1.15% (approximating RSP), a +25.2% compound excess, and ranks at the 99.7th percentile of 10,000 Monte Carlo portfolios (p=0.003). The S&P 100 cohort (35 months) delivers a +30.5% compound excess over EQWL with consistent direction but formal significance not reached, limited by the small average selection of ~10 stocks per month. Non-negative least-squares projection of thesis embeddings onto agent embeddings reveals an adaptive-integration mechanism. Agent contributions rotate with market regime (Fundamentals leads on S&P 500, Macro on S&P 100, Dynamics acts as an episodic momentum signal) and this agent rotation moves in lockstep with both the sector composition of strong-buy selections and identifiable macro-calendar events, three independent views of the same underlying adaptation. The recommendation's cross-sectional Information Coefficient is statistically significant on S&P 500 (ICIR=+0.489, p=0.024). These results suggest that multi-agent LLM equity systems can identify sources of alpha beyond what classical factor models capture, and that the buy signal functions as an effective universe-filter that can sit upstream of any portfolio-construction process.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that MarketSenseAI’s strong-buy recommendations yield statistically significant excess returns (+25.2pp compounded) compared to passive benchmarks.
It deploys a live, multi-agent framework where four LLM agents synthesize diversified inputs from news, fundamentals, dynamics, and macro data.
NNLS decomposition confirms the attribution of signals to individual agents, highlighting adaptive regime rotation and actionable equity selection.

Signal or Noise? Portfolio-Level Evidence from Multi-Agent LLM-Based Stock Recommendations

Introduction

The paper "Signal or Noise in Multi-Agent LLM-based Stock Recommendations?" (2604.17327) presents a rigorous portfolio-level evaluation of MarketSenseAI, a deployed multi-agent LLM equity signal generation platform. The core inquiry is whether the system’s strong-buy stock recommendations, synthesized from four specialist LLM agents (News, Fundamentals, Dynamics, Macro), yield statistically significant outperformance versus both passive benchmarks and randomly constructed portfolios. Notably, the empirical framework is structured to control for look-ahead bias and survivorship, with all signals generated live at each observation date and statistical inference anchored in direct Monte Carlo randomization rather than retrospective backtesting.

System Architecture and Live Signal Generation

MarketSenseAI comprises four domain-specialized LLM agents, each producing focused analysis for every stock at each monthly observation date: News (company-specific news flows), Fundamentals (financial statements, transcripts), Dynamics (technical/price-action signals), and Macro (sector/macro regime). These analyses are synthesized by a higher-level agent, which generates a unified thesis and a 5-class ordinal recommendation (“strong sell” through “strong buy”). All textual outputs are embedded using OpenAI’s text-embedding-3-small (dimension 1536), and both the agent and thesis embeddings are available for downstream analysis and attribution.

The design is fully live: at every date, MarketSenseAI draws only on the “current” data available, ensuring that news and earnings information is genuinely unknown ex ante, eliminating knowledge leakage and classic look-ahead bias. This rigorous setup underpins the integrity of the performance results and the attribution diagnostics.

Statistical Evaluation: Monte Carlo Design and Portfolio Results

Performance is benchmarked in two fixed, non-overlapping cohorts. At each monthly date, the set of strong-buy recommendations is extracted and used to form an equal-weighted portfolio. The principal test is against an MC null: for each date, 10,000 random portfolios are drawn from the same eligible set, matching the actual portfolio in number, universe, and timing. This ensures any measured outperformance is purely due to the selection skill, not universe or timing effects.

Figure 2: Monte Carlo null distributions of mean monthly equal-weight portfolio returns, illustrating the strong-buy result at the far right tail (S cohort, 19 months).

On the S cohort (Sep 2024–Mar 2026, 467 stocks, 19 dates), the strong-buy portfolio earns +2.18% per month versus a passive benchmark of +1.15% (approximating RSP). The compounded excess over the period reaches +25.2 percentage points. The MC-null p-value is 0.003, positioning the actual result at the 99.7th percentile—strong evidence for genuine selection skill.

Figure 4: Compound growth of the strong-buy portfolio vs the EW-universe benchmark and MC null; strong-buy path lies above null and benchmark over most of the sample.

The robustness check on the narrower S cohort (94 stocks, 35 months) shows directional consistency (+30.5pp compounded excess), though p-value is 0.17—statistical power is attenuated due to small average portfolio size.

Crucially, the MC design neutralizes systematic factor exposures and market timing, so the excess cannot be ascribed to generic momentum or beta exposures. Further, regression of portfolio returns on the market delivers $\beta=0.865$ (below one), and the system preserves more excess return in down-market months than up, ruling out a naïve risk-loading explanation.

Attribution: NNLS Decomposition of Synthesis Theses

A core methodological contribution is the application of NNLS (non-negative least squares) to decompose the thesis embedding for each stock-date onto the four agent embeddings. This yields a vector of attribution weights reflecting the influence of each underlying LLM agent on the ultimate recommendation.

The agent embedding subspace reconstructs the thesis with high fidelity ( $C_\mathrm{TR}=0.944$ ). Pairwise agent embedding cosines are 0.46–0.79, justifying the use of joint NNLS regression rather than univariate cosine—single-agent attribution would confound collinearity with causal influence.

Figure 5: Thesis–agent cosine similarity heatmap, indicating high mutual similarity and requiring multivariate attribution for interpretability.

Spearman correlation between each agent’s cosine similarity and its NNLS weight is high ( $\rho_s$ up to 0.83), validating the attribution structure.

Cross-Sectional Performance and Information Coefficient

To quantify ranking skill, the study computes the Information Coefficient (IC)—cross-sectional Spearman $\rho$ between signal and one-month forward returns—both pooled and at the date level. On actionable long signals (buy or strong-buy), the ordinal recommendation score demonstrates a statistically significant IC ( $+0.051$ , $p=0.024$ , $t=2.13$ on S; threshold for $p<0.05$ is $|t|>2.10$ for 19 dates).

Figure 1: Date-level cross-sectional IC for ordinal recommendation, with significance line and positive-IC dates highlighted (S cohort).

Among agents, Fundamentals and Macro weights exhibit persistent positive pooled IC, with Macro leading in concentrated universes. Notably, no single agent dominates in all regimes: Fundamentals leads in stable periods, Macro during macro-driven events, and Dynamics emerges episodically as a leading signal under discrete momentum regimes. The best-agent timeline illustrates rotation in influence, consistent with economic regime changes.

Figure 3: Timeline of dominant agent by date, demonstrating adaptive switching among agents as economic regime evolves.

Additionally, NNLS agent weights encode finer-grained cross-sectional return information than the ordinal label, but the synthesis agent effectively compresses this information for actionable universe filtering.

Sector Heterogeneity, Adaptive Integration, and Return Distribution

Strong-buy selections rotate non-trivially among sectors over time, with Financials overweight early in the sample and Information Technology and Health Care increasing later—mirroring observable macro-calendar events and economic regime shifts.

Figure 8: Sector composition of strong-buy basket by month, visualizing sector rotation and non-static selection bias.

Sector-wise attribution shows that, e.g., Info Tech stocks are driven more by Macro-weighted theses, while Energy/Utilities weight Dynamics more due to sector-specific signal structure.

Return distribution analysis of strong-buy signals shows not only excess mean but reduced downside risk compared to “hold” signals. This is reflected in the left-tail of the empirical CDF.

Figure 6: Empirical CDFs of strong-buy versus hold signal forward returns, demonstrating lower downside event probability for strong-buys.

Practical Implications and Theoretical Impacts

The finding that the strong-buy recommendation outperforms passive and random benchmarks, with robust controls against look-ahead and survivorship bias, provides rare and transparent evidence that LLM-based multi-agent signal generation can produce actionable investment signals not trivially reduced to classical factor exposures.

The adaptive integration observed, where agent influence rotates endogenously with market regime and sectoral leadership, is indicative of context-sensitive mixture-of-experts behavior—reminiscent of adaptive ensemble techniques but operationalized through interpretably structured LLM agent teams. This reinforces the potential for multi-agent LLMs to act as dynamic context delegators, discover new information orthogonal to traditional factors, and serve as robust universe filters for downstream portfolio construction.

Conclusion

The MarketSenseAI empirical study demonstrates, under formal MC-inference and robust attribution diagnostics, that LLM-based multi-agent recommendations contain genuine selection signal not explained by passive benchmark exposures or naïve factor loading. Agent-driven adaptation to economic regime and sector context is directly interpretable via NNLS embedding analysis.

These results provide foundational evidence that multi-agent LLM systems can yield interpretable, contextually aware signals that add measurable value at the portfolio level. The findings motivate future research into longer time horizons, more diverse universes (including non-US equity, mid/small caps), alternative attribution methodologies, and direct integration with downstream optimization or risk overlays. They also highlight a new direction for research in agent prioritization, dynamic agent weighting, and the potential for more context-aware synthesis architectures to further amplify alpha generation in noisy financial environments.