One Size Fits None: Heuristic Collapse in LLM Investment Advice

Published 26 Apr 2026 in cs.CL and cs.LG | (2604.23837v1)

Abstract: LLMs are increasingly deployed as advisors in high-stakes domains -- answering medical questions, interpreting legal documents, recommending financial products -- where good advice requires integrating a user's full context rather than responding to salient surface features. We investigate whether frontier LLMs actually do this, or whether they instead exhibit heuristic collapse: a systematic reduction of complex, multi-factor decisions to a small number of dominant inputs. We study the phenomenon in investment advice, where legal standards explicitly require individualized reasoning over a client's full circumstances. Applying interpretable surrogate models to LLM outputs, we find systematic heuristic collapse: investment allocation decisions are largely determined by self-reported risk tolerance, while other relevant factors contribute minimally. We further find that web search partially attenuates heuristic collapse but does not resolve it. These findings suggest that heuristic collapse is not resolved by web search augmentation or model scale alone, and that deploying LLMs as advisors requires auditing input sensitivity, not just output quality.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a diagnostic framework to quantify heuristic collapse in LLMs by analyzing input sensitivity and feature concentration.
It reveals that LLMs primarily rely on self-reported risk tolerance, with this feature accounting for 57–88% of decision influence in portfolio recommendations.
Web search augmentation has mixed effects, reducing asset concentration in some cases while often homogenizing personalized advice across clients.

Heuristic Collapse in LLM Investment Advice: A Technical Evaluation

Introduction

The paper "One Size Fits None: Heuristic Collapse in LLM Investment Advice" (2604.23837) investigates the propensity of LLMs to reduce multi-factor decision problems in investment advice to one-dimensional heuristics, a process the authors term heuristic collapse. Rather than synthesizing recommendations from the full context of user profiles—as is imperative given fiduciary legal standards—LLMs often anchor recommendations on a single salient input, predominantly self-reported risk tolerance, sidelining critical context such as age, income, and liquidity needs. The study provides a rigorous empirical pipeline to diagnose this failure mode systematically and evaluates the impact of web search augmentation as a mitigation strategy.

Methodological Contributions

The authors implement a controlled, diagnostic framework to elicit and quantify heuristic collapse. Key methodological components include:

Synthetic Population Generation: 1,000 synthetic client profiles are constructed using Latin hypercube sampling, guaranteeing low correlation and broad coverage across the input domain, enabling robust sensitivity analyses of LLM input dependence.
LLM Auditing via Surrogate Models: For each model and condition, input-output mappings are reconstructed with interpretable surrogate models (Random Forest and Ridge Regression). Feature concentration (FC)—derived via an adaptation of the Herfindahl-Hirschman Index—is proposed as an operational metric for heuristic collapse.
Portfolio Recommendation Analysis: LLM outputs are assessed along two axes: diversification (using HHI) and personalization (using pairwise Jaccard similarity). This dual assessment isolates both the heterogeneity of recommendations and their dependence on specific client features.
Web Search Augmentation: By requiring models to use web search tools for relevant, real-time financial data, the authors probe whether access to current information facilitates deeper, more individualized reasoning.

This framework distinguishes itself from prior LLM evaluation paradigms, which focus predominantly on output quality or factual accuracy, by directly interrogating the sensitivity of outputs to the full range of relevant inputs.

Empirical Results

Heuristic Collapse and Input Sensitivity

Strong evidence is provided for systematic heuristic collapse:

Across all tested GPT-family models, investment recommendations, particularly allocations to equities and tax-advantaged accounts, are dominated by self-reported risk tolerance.
For GPT-4o, surrogate model analysis reveals that risk tolerance alone accounts for 57–88% of predictive weight in allocation decisions, with negligible marginal influence from age, income, investment horizon, or liquidity needs.
Even among more capable models such as GPT-5.4, while some diffusion of feature importance is observed, the overarching pattern remains a primary reliance on a single heuristic feature.

Effects of Web Search Augmentation

Web search yields partially mitigating but inconsistent effects:

Diversification: Web search reduces portfolio concentration (HHI) across models, suggesting improved asset-class spread when LLMs have access to up-to-date market data.
Personalization: Notably, portfolio recommendations become more homogeneous across clients following web search, as measured by increased average Jaccard similarity, indicating decreased client-specificity.
Heuristic Collapse: Feature concentration of surrogate models is sometimes attenuated (notably for cash and savings, as well as tax-advantaged accounts for GPT-4o), reflecting more distributed input utilization. In other cases, especially for equities and fixed income, little change or even increased collapse is observed.

Qualitative Rationale Evaluation

Judge-based evaluation of LLM rationales, using independent LLMs, demonstrates:

Web search does not uniformly improve rationale quality. For GPT-4o, rationale specificity, market grounding, and depth all decline after web search integration.
By contrast, GPT-5.4 shows consistent gains in rationale quality post web search, producing more concrete, market-referenced justifications.
However, in no model-tool configuration do rationales consistently satisfy all dimensions required for fiduciary suitability.

Implications and Theoretical Significance

The paper's findings have substantive implications for both practical deployment and foundational LLM research:

Regulatory and Fiduciary Compliance: Current LLMs, even with tool augmentation, cannot be presumed compliant with legal standards mandating holistic, individualized investment advice. Reliance on self-reported risk tolerance is particularly problematic given its well-known volatility and unreliability as a proxy for true risk preference.
Behavioral Anchoring and Shortcut Learning: The behavior observed aligns with broader findings on shortcut learning in deep neural networks: LLMs systematically favor heuristics over full-context reasoning, mirroring cognitive biases such as anchoring and framing documented elsewhere in the literature (e.g., Suri et al., 2024; Malberg et al., 2025).
Limitations of Model Scaling and Tool Use: Increases in parameter count or augmentation with retrieval tools do not guarantee deeper integration of contextual information or more suitable personalization. Output quality metrics alone are insufficient for regulatory or practical assurance; population-level input sensitivity auditing is necessary.
Evaluation Paradigms: The diagnostic pipeline presented advances the LLM evaluation paradigm by introducing a reproducible method for surfacing and quantifying insensitivity to input diversity—a criterion often invisible to standard benchmarks.

Future Research Directions

Several promising avenues are identified for advancement:

Model Family Comparisons: Systematic auditing across open-source and closed-source LLMs, as well as non-GPT architectures, to ascertain the generality of heuristic collapse.
Intervention Strategies: Fine-tuning, multi-turn conversational prompting, and structured information elicitation should be examined for their efficacy in mitigating heuristic collapse.
Realistic Data Validation: Extending analysis from synthetic profiles to real-world investor data will be vital for external validation.
Generalization to Other Domains: Heuristic collapse in LLMs likely extends to medical, legal, and educational recommendation systems; domain-specific adaptations of the auditing pipeline will be necessary.

Conclusion

The study demonstrates that LLMs, when tasked with investment advice, exhibit systematic heuristic collapse—outputs are primarily driven by a single feature, rendering most of the user's context functionally ignored. Augmentation with web search partially reduces, but does not eliminate, this failure mode and often homogenizes advice across clients. These findings indicate that merely improving output quality or enabling access to external data is insufficient for fiduciary, individualized decision-making. Instead, explicit auditing of input sensitivity and context integration is essential before LLMs can be reliably deployed in high-stakes advisory domains.

Markdown Report Issue