- The paper demonstrates that MarketSenseAI’s strong-buy recommendations yield statistically significant excess returns (+25.2pp compounded) compared to passive benchmarks.
- It deploys a live, multi-agent framework where four LLM agents synthesize diversified inputs from news, fundamentals, dynamics, and macro data.
- NNLS decomposition confirms the attribution of signals to individual agents, highlighting adaptive regime rotation and actionable equity selection.
Signal or Noise? Portfolio-Level Evidence from Multi-Agent LLM-Based Stock Recommendations
Introduction
The paper "Signal or Noise in Multi-Agent LLM-based Stock Recommendations?" (2604.17327) presents a rigorous portfolio-level evaluation of MarketSenseAI, a deployed multi-agent LLM equity signal generation platform. The core inquiry is whether the system’s strong-buy stock recommendations, synthesized from four specialist LLM agents (News, Fundamentals, Dynamics, Macro), yield statistically significant outperformance versus both passive benchmarks and randomly constructed portfolios. Notably, the empirical framework is structured to control for look-ahead bias and survivorship, with all signals generated live at each observation date and statistical inference anchored in direct Monte Carlo randomization rather than retrospective backtesting.
System Architecture and Live Signal Generation
MarketSenseAI comprises four domain-specialized LLM agents, each producing focused analysis for every stock at each monthly observation date: News (company-specific news flows), Fundamentals (financial statements, transcripts), Dynamics (technical/price-action signals), and Macro (sector/macro regime). These analyses are synthesized by a higher-level agent, which generates a unified thesis and a 5-class ordinal recommendation (“strong sell” through “strong buy”). All textual outputs are embedded using OpenAI’s text-embedding-3-small (dimension 1536), and both the agent and thesis embeddings are available for downstream analysis and attribution.
The design is fully live: at every date, MarketSenseAI draws only on the “current” data available, ensuring that news and earnings information is genuinely unknown ex ante, eliminating knowledge leakage and classic look-ahead bias. This rigorous setup underpins the integrity of the performance results and the attribution diagnostics.
Statistical Evaluation: Monte Carlo Design and Portfolio Results
Performance is benchmarked in two fixed, non-overlapping cohorts. At each monthly date, the set of strong-buy recommendations is extracted and used to form an equal-weighted portfolio. The principal test is against an MC null: for each date, 10,000 random portfolios are drawn from the same eligible set, matching the actual portfolio in number, universe, and timing. This ensures any measured outperformance is purely due to the selection skill, not universe or timing effects.
Figure 2: Monte Carlo null distributions of mean monthly equal-weight portfolio returns, illustrating the strong-buy result at the far right tail (S cohort, 19 months).
On the S cohort (Sep 2024–Mar 2026, 467 stocks, 19 dates), the strong-buy portfolio earns +2.18% per month versus a passive benchmark of +1.15% (approximating RSP). The compounded excess over the period reaches +25.2 percentage points. The MC-null p-value is 0.003, positioning the actual result at the 99.7th percentile—strong evidence for genuine selection skill.
Figure 4: Compound growth of the strong-buy portfolio vs the EW-universe benchmark and MC null; strong-buy path lies above null and benchmark over most of the sample.
The robustness check on the narrower S cohort (94 stocks, 35 months) shows directional consistency (+30.5pp compounded excess), though p-value is 0.17—statistical power is attenuated due to small average portfolio size.
Crucially, the MC design neutralizes systematic factor exposures and market timing, so the excess cannot be ascribed to generic momentum or beta exposures. Further, regression of portfolio returns on the market delivers β=0.865 (below one), and the system preserves more excess return in down-market months than up, ruling out a naïve risk-loading explanation.
Attribution: NNLS Decomposition of Synthesis Theses
A core methodological contribution is the application of NNLS (non-negative least squares) to decompose the thesis embedding for each stock-date onto the four agent embeddings. This yields a vector of attribution weights reflecting the influence of each underlying LLM agent on the ultimate recommendation.
The agent embedding subspace reconstructs the thesis with high fidelity (CTR=0.944). Pairwise agent embedding cosines are 0.46–0.79, justifying the use of joint NNLS regression rather than univariate cosine—single-agent attribution would confound collinearity with causal influence.
Figure 5: Thesis–agent cosine similarity heatmap, indicating high mutual similarity and requiring multivariate attribution for interpretability.
Spearman correlation between each agent’s cosine similarity and its NNLS weight is high (ρs up to 0.83), validating the attribution structure.
To quantify ranking skill, the study computes the Information Coefficient (IC)—cross-sectional Spearman ρ between signal and one-month forward returns—both pooled and at the date level. On actionable long signals (buy or strong-buy), the ordinal recommendation score demonstrates a statistically significant IC (+0.051, p=0.024, t=2.13 on S; threshold for p<0.05 is ∣t∣>2.10 for 19 dates).
Figure 1: Date-level cross-sectional IC for ordinal recommendation, with significance line and positive-IC dates highlighted (S cohort).
Among agents, Fundamentals and Macro weights exhibit persistent positive pooled IC, with Macro leading in concentrated universes. Notably, no single agent dominates in all regimes: Fundamentals leads in stable periods, Macro during macro-driven events, and Dynamics emerges episodically as a leading signal under discrete momentum regimes. The best-agent timeline illustrates rotation in influence, consistent with economic regime changes.
Figure 3: Timeline of dominant agent by date, demonstrating adaptive switching among agents as economic regime evolves.
Additionally, NNLS agent weights encode finer-grained cross-sectional return information than the ordinal label, but the synthesis agent effectively compresses this information for actionable universe filtering.
Sector Heterogeneity, Adaptive Integration, and Return Distribution
Strong-buy selections rotate non-trivially among sectors over time, with Financials overweight early in the sample and Information Technology and Health Care increasing later—mirroring observable macro-calendar events and economic regime shifts.
Figure 8: Sector composition of strong-buy basket by month, visualizing sector rotation and non-static selection bias.
Sector-wise attribution shows that, e.g., Info Tech stocks are driven more by Macro-weighted theses, while Energy/Utilities weight Dynamics more due to sector-specific signal structure.
Return distribution analysis of strong-buy signals shows not only excess mean but reduced downside risk compared to “hold” signals. This is reflected in the left-tail of the empirical CDF.
Figure 6: Empirical CDFs of strong-buy versus hold signal forward returns, demonstrating lower downside event probability for strong-buys.
Practical Implications and Theoretical Impacts
The finding that the strong-buy recommendation outperforms passive and random benchmarks, with robust controls against look-ahead and survivorship bias, provides rare and transparent evidence that LLM-based multi-agent signal generation can produce actionable investment signals not trivially reduced to classical factor exposures.
The adaptive integration observed, where agent influence rotates endogenously with market regime and sectoral leadership, is indicative of context-sensitive mixture-of-experts behavior—reminiscent of adaptive ensemble techniques but operationalized through interpretably structured LLM agent teams. This reinforces the potential for multi-agent LLMs to act as dynamic context delegators, discover new information orthogonal to traditional factors, and serve as robust universe filters for downstream portfolio construction.
Conclusion
The MarketSenseAI empirical study demonstrates, under formal MC-inference and robust attribution diagnostics, that LLM-based multi-agent recommendations contain genuine selection signal not explained by passive benchmark exposures or naïve factor loading. Agent-driven adaptation to economic regime and sector context is directly interpretable via NNLS embedding analysis.
These results provide foundational evidence that multi-agent LLM systems can yield interpretable, contextually aware signals that add measurable value at the portfolio level. The findings motivate future research into longer time horizons, more diverse universes (including non-US equity, mid/small caps), alternative attribution methodologies, and direct integration with downstream optimization or risk overlays. They also highlight a new direction for research in agent prioritization, dynamic agent weighting, and the potential for more context-aware synthesis architectures to further amplify alpha generation in noisy financial environments.