- The paper introduces a diagnostic framework to quantify heuristic collapse in LLMs by analyzing input sensitivity and feature concentration.
- It reveals that LLMs primarily rely on self-reported risk tolerance, with this feature accounting for 57–88% of decision influence in portfolio recommendations.
- Web search augmentation has mixed effects, reducing asset concentration in some cases while often homogenizing personalized advice across clients.
Heuristic Collapse in LLM Investment Advice: A Technical Evaluation
Introduction
The paper "One Size Fits None: Heuristic Collapse in LLM Investment Advice" (2604.23837) investigates the propensity of LLMs to reduce multi-factor decision problems in investment advice to one-dimensional heuristics, a process the authors term heuristic collapse. Rather than synthesizing recommendations from the full context of user profiles—as is imperative given fiduciary legal standards—LLMs often anchor recommendations on a single salient input, predominantly self-reported risk tolerance, sidelining critical context such as age, income, and liquidity needs. The study provides a rigorous empirical pipeline to diagnose this failure mode systematically and evaluates the impact of web search augmentation as a mitigation strategy.
Methodological Contributions
The authors implement a controlled, diagnostic framework to elicit and quantify heuristic collapse. Key methodological components include:
- Synthetic Population Generation: 1,000 synthetic client profiles are constructed using Latin hypercube sampling, guaranteeing low correlation and broad coverage across the input domain, enabling robust sensitivity analyses of LLM input dependence.
- LLM Auditing via Surrogate Models: For each model and condition, input-output mappings are reconstructed with interpretable surrogate models (Random Forest and Ridge Regression). Feature concentration (FC)—derived via an adaptation of the Herfindahl-Hirschman Index—is proposed as an operational metric for heuristic collapse.
- Portfolio Recommendation Analysis: LLM outputs are assessed along two axes: diversification (using HHI) and personalization (using pairwise Jaccard similarity). This dual assessment isolates both the heterogeneity of recommendations and their dependence on specific client features.
- Web Search Augmentation: By requiring models to use web search tools for relevant, real-time financial data, the authors probe whether access to current information facilitates deeper, more individualized reasoning.
This framework distinguishes itself from prior LLM evaluation paradigms, which focus predominantly on output quality or factual accuracy, by directly interrogating the sensitivity of outputs to the full range of relevant inputs.
Empirical Results
Strong evidence is provided for systematic heuristic collapse:
- Across all tested GPT-family models, investment recommendations, particularly allocations to equities and tax-advantaged accounts, are dominated by self-reported risk tolerance.
- For GPT-4o, surrogate model analysis reveals that risk tolerance alone accounts for 57–88% of predictive weight in allocation decisions, with negligible marginal influence from age, income, investment horizon, or liquidity needs.
- Even among more capable models such as GPT-5.4, while some diffusion of feature importance is observed, the overarching pattern remains a primary reliance on a single heuristic feature.
Effects of Web Search Augmentation
Web search yields partially mitigating but inconsistent effects:
- Diversification: Web search reduces portfolio concentration (HHI) across models, suggesting improved asset-class spread when LLMs have access to up-to-date market data.
- Personalization: Notably, portfolio recommendations become more homogeneous across clients following web search, as measured by increased average Jaccard similarity, indicating decreased client-specificity.
- Heuristic Collapse: Feature concentration of surrogate models is sometimes attenuated (notably for cash and savings, as well as tax-advantaged accounts for GPT-4o), reflecting more distributed input utilization. In other cases, especially for equities and fixed income, little change or even increased collapse is observed.
Qualitative Rationale Evaluation
Judge-based evaluation of LLM rationales, using independent LLMs, demonstrates:
- Web search does not uniformly improve rationale quality. For GPT-4o, rationale specificity, market grounding, and depth all decline after web search integration.
- By contrast, GPT-5.4 shows consistent gains in rationale quality post web search, producing more concrete, market-referenced justifications.
- However, in no model-tool configuration do rationales consistently satisfy all dimensions required for fiduciary suitability.
Implications and Theoretical Significance
The paper's findings have substantive implications for both practical deployment and foundational LLM research:
- Regulatory and Fiduciary Compliance: Current LLMs, even with tool augmentation, cannot be presumed compliant with legal standards mandating holistic, individualized investment advice. Reliance on self-reported risk tolerance is particularly problematic given its well-known volatility and unreliability as a proxy for true risk preference.
- Behavioral Anchoring and Shortcut Learning: The behavior observed aligns with broader findings on shortcut learning in deep neural networks: LLMs systematically favor heuristics over full-context reasoning, mirroring cognitive biases such as anchoring and framing documented elsewhere in the literature (e.g., Suri et al., 2024; Malberg et al., 2025).
- Limitations of Model Scaling and Tool Use: Increases in parameter count or augmentation with retrieval tools do not guarantee deeper integration of contextual information or more suitable personalization. Output quality metrics alone are insufficient for regulatory or practical assurance; population-level input sensitivity auditing is necessary.
- Evaluation Paradigms: The diagnostic pipeline presented advances the LLM evaluation paradigm by introducing a reproducible method for surfacing and quantifying insensitivity to input diversity—a criterion often invisible to standard benchmarks.
Future Research Directions
Several promising avenues are identified for advancement:
- Model Family Comparisons: Systematic auditing across open-source and closed-source LLMs, as well as non-GPT architectures, to ascertain the generality of heuristic collapse.
- Intervention Strategies: Fine-tuning, multi-turn conversational prompting, and structured information elicitation should be examined for their efficacy in mitigating heuristic collapse.
- Realistic Data Validation: Extending analysis from synthetic profiles to real-world investor data will be vital for external validation.
- Generalization to Other Domains: Heuristic collapse in LLMs likely extends to medical, legal, and educational recommendation systems; domain-specific adaptations of the auditing pipeline will be necessary.
Conclusion
The study demonstrates that LLMs, when tasked with investment advice, exhibit systematic heuristic collapse—outputs are primarily driven by a single feature, rendering most of the user's context functionally ignored. Augmentation with web search partially reduces, but does not eliminate, this failure mode and often homogenizes advice across clients. These findings indicate that merely improving output quality or enabling access to external data is insufficient for fiduciary, individualized decision-making. Instead, explicit auditing of input sensitivity and context integration is essential before LLMs can be reliably deployed in high-stakes advisory domains.