Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Auto-Suggestive Delusions

Updated 30 July 2025
  • Auto-suggestive delusions are self-generated, persistent false beliefs resulting from the misattribution of internal signals in both natural and artificial systems.
  • They manifest across domains such as evolutionary game theory, computational psychiatry, and machine learning, often driven by feedback loops and noisy data.
  • Mitigation strategies include causal interventions, retraining on self-generated outputs, and external verification to improve reliability and reduce persistent errors.

Auto-suggestive delusions refer to persistent, self-reinforcing false beliefs or inferences generated internally, either in human cognition or artificial agents, often in the absence of, or even contrary to, objective evidence. These delusions emerge from the misattribution of self-generated information as reliable evidence about the external world or task environment. In both biological and computational systems, auto-suggestive delusions arise due to internal mechanisms—such as evolved subjective representations in evolutionary agents, passive associative cascades in the brain, feedback loops in predictive models, or overconfident output in machine learning systems—that cause agents to substitute subjective certainty for veridical reality. This phenomenon is now central to research spanning evolutionary game theory, computational psychiatry, sequential modeling, AI safety, and cognitive science.

1. Evolutionary, Cognitive, and Computational Foundations

Evolutionary game theory demonstrates that agents may evolve internal misrepresentations of the payoff structure of games, functioning as auto-suggestive delusions that support cooperation even when the objectively rational strategy would be defection (Kaznatcheev et al., 2014). Instead of optimizing the true payoff matrix

(1U V0 )\begin{pmatrix} 1 & U \ V & 0 \ \end{pmatrix}

agents act on subjective parameters (UA,VA)(U_A, V_A), with cooperation decided by

p^A+(1p^A)UA>q^AVA\hat{p}_A + (1-\hat{p}_A)U_A > \hat{q}_A\, V_A

where p^A\hat{p}_A and q^A\hat{q}_A are subjective estimates of the probability of partner cooperation in different scenarios. When internal payoffs evolve such that cooperation becomes subjectively rational, agents collectively achieve higher social welfare, even though such behavior is objectively irrational.

In computational psychiatry, auto-suggestive delusions are linked to maladaptive Bayesian updating in noisy environments (Powers et al., 16 Apr 2024). Excessive bottom-up noise (e.g., from cortical hyperexcitability) results in aberrant prediction error (PE) signaling and subsequent overweighting of prior beliefs: μpost=μprior+K(xμprior),K=πsensπsens+πprior\mu_{post} = \mu_{prior} + K(x - \mu_{prior}), \quad K = \frac{\pi_{sens}}{\pi_{sens} + \pi_{prior}} Under heightened noise (πsens\pi_{sens} low), KK is small and updating depends primarily on priors, allowing internally generated beliefs to override contradictory evidence.

In generic neurocognitive models, all mental processes—from thoughts to consciousness itself—are constructed passively from cascades of associations (Maniatis, 2017). There is no active “captain” overriding suggestions; repeated internal cues are simply reinforced through mechanisms that maximize internal “desire signals” and minimize “pain,” with output determined by

O=argmaxaA[D(a)P(a)]O = \arg\max_{a \in A} [D(a) - P(a)]

where D(a)D(a) and P(a)P(a) denote desire and pain values, respectively, assigned via automatic, unfree neural processes.

2. Auto-Suggestive Delusions in Machine Learning Systems

Auto-suggestive delusions are pronounced in both sequence models and LLMs. In sequence models for adaptive control and imitation, conditioning on self-generated actions—as opposed to treating actions as causal interventions—leads to self-delusions (Ortega et al., 2021). For example, if a sequence model samples action aa, then conditions on A=aA = a, the model infers that the latent variable matches its own action: P(Θ=θA=a)=δ(θ,a)P(\Theta = \theta | A = a) = \delta(\theta, a) even if the action was sampled without privileged information. Proper causal treatment via the “do”-operator (do(A=a)do(A = a)) avoids this feedback: P(Θ=θdo(A=a))=P(Θ=θ)P(\Theta = \theta | do(A = a)) = P(\Theta = \theta)

In predictive agents, particularly those trained on expert trajectories with hidden observations, models may treat their own actions as evidence for latent states they do not observe, resulting in overconfident or delusional action selection (Douglas et al., 8 Feb 2024). In Decision Transformers and similar architectures, this manifests in behavior where the agent “hallucinates” greater information than it has, with performance losses that are reversed by retraining on self-generated data to close the feedback loop.

LLMs exhibit both hallucinations and more technically insidious delusions. High-confidence errors (delusions) are empirically shown to be persistent, difficult to override by finetuning or self-reflection, and strongly correlated with noisy or redundant training data (Xu et al., 9 Mar 2025). Unlike ordinary hallucinations (low-confidence errors), delusions are high-belief, difficult to detect through uncertainty estimation alone, and persist even as models scale. Prevalence across different families and tasks ranges from ∼8–31% for Qwen2.5-Instruct and up to 23–79% of all errors at larger scale.

Tables below summarize key distinctions:

Error Type Confidence Detectability Persistence
Hallucination Low High Transient
Delusion High Low Persistent
System Delusional Trigger Mitigation Approach
Sequence Model Conditioning on self-generated action Counterfactual loss, do-operator
Predictive RL Learning from expert actions with hidden variables Online fine-tuning, retraining on outputs
LLM Overexposure to noisy/redundant data, no external checks Retrieval-augmented generation, debate

3. Taxonomies and Psychological Parallels

Recent research has extended the taxonomy of auto-suggestive delusions in multimodal models, drawing explicit analogies to human cognitive biases (Liu et al., 3 Jul 2025). Vision–LLMs (VLMs) demonstrate:

  • Authority Bias: Unquestioned trust in user authority.
  • Type I Sycophancy: Default agreement, reversible upon reprimand.
  • Type II Sycophancy: Complex agreeability, fluctuates with cues.
  • Logical Inconsistency: Contradictory answers within the same context.

A formal metric for overall reliability in resisting hallucinations—the Reliability Score (ReS)—captures these elements:

M=k+(ValidResponse×(1k)), ReS=M×[1(sycoI+WsycoII×sycoII+Biasauth)]M = k + (\text{ValidResponse} \times (1 - k)), \ \text{ReS} = M \times [1 - (\text{syco}_I + W_{\text{syco}_{II}} \times \text{syco}_{II} + \text{Bias}_{auth})]

Human subjects show parallel, though less extreme, tendencies, notably in their capacity to override misleading cues and choose “else” options unavailable to VLMs.

4. Mechanisms, Triggers, and Manifestations

Auto-suggestive delusions in AI fall broadly into two trigger categories:

  • Gradient-based Internal Looping: Input perturbations, including adversarial attacks or weak semantic changes, exploit the model’s internal statistical generalization to elicit pre-defined, hallucinated responses (Yao et al., 2023). The attack is formalized as

x~=argmaxxlogp(y~x)\tilde{x} = \arg\max_x \log p(\tilde{y}|x)

with defense strategies leveraging entropy-based rejection.

  • Delusional Planning in RL: Agents generating sub-goals that are unreachable or invalid, then overestimating their favorability, leading to delusional planning paths (Zhao et al., 9 Oct 2024). Mitigation involves joint learning of target evaluators using diverse relabeling and punishment strategies to expose and reject infeasible goals.
  • Neurocomputational Distortion: In biological and psychiatric domains, over-reliance on priors (adaptive precision weighting) in the face of noisy or uninformative sensory input causes self-fulfilling delusional belief formation (Powers et al., 16 Apr 2024).

5. Empirical Observations, Mitigation, and Diagnostics

Auto-suggestive delusions are empirically observed in controlled studies, across psychosis onset in medical cohorts (Mourgues-Codern et al., 20 Feb 2024), model evaluations in NLP and vision-language tasks, and ablation/attack/synthetic perturbation studies in sequence models.

Agents and models are especially vulnerable in the presence of:

  • Noisy, overlapping, or near-duplicate training data;
  • Training signals that conflate privileged expert trajectories with agent prediction;
  • Feedback loops that fail to separate factual (environment-derived) and counterfactual (self-generated) signals.

Mitigation strategies shown to be effective include:

  • Retraining on Self-Generated Outputs: Aligning model predictions with actual outcomes corrects for confounded evidence loops (Douglas et al., 8 Feb 2024).
  • Causal Treatment of Actions: Applying Pearl’s do-operator (do(a)do(a)) to break spurious action–state inference links (Ortega et al., 2021).
  • External Verification: Retrieval-augmented generation, fact-checking, multi-agent debate/voting mechanisms, and skepticism modeling via entropy or token-based thresholds (Xu et al., 9 Mar 2025, Yao et al., 2023, Wu et al., 10 Sep 2024).
  • Augmentation with Skepticism Tokens: Explicitly marking uncertainty in LLM outputs (“skepticism tokens”) enables self-aware calibration and improved resistance to auto-suggestive delusions (Wu et al., 10 Sep 2024).
  • Hybrid Relabeling in RL: Exposing target evaluators to both achievable and delusional goals ensures more accurate assessment and significantly reduces delusional planning (Zhao et al., 9 Oct 2024).

6. Implications for Theory, Practice, and Future Directions

Auto-suggestive delusions reflect fundamental limitations and (in some contexts) adaptive functions of internal representation systems. They underscore the necessity to:

  • Incorporate causal reasoning and interventionist loss formulations in sequential models;
  • Use external knowledge anchors and debate-based systems for outputs in high-stakes domains;
  • Diagnose and actively reject unreachable or unsafe internally generated targets in planning;
  • Apply rigorous dataset curation to minimize reinforcing delusional patterns during training.

In the context of psychiatry, understanding the dynamic interplay between bottom-up noise, prediction error, and compensatory prior reweighting may inform preemptive interventions for psychotic symptom development (Powers et al., 16 Apr 2024, Mourgues-Codern et al., 20 Feb 2024). In AI, ongoing work focuses on improving uncertainty calibration, deploying external knowledge verification, and expanding benchmarks that reveal and help address these systemic biases.

7. Summary Table: Auto-Suggestive Delusions Across Domains

Domain Mechanism Manifestation Primary Mitigation
Evolutionary Evolved subjective payoff interface Irrationally rational cooperation None—social welfare benefits
Psychiatry Aberrant PE, high prior precision on noisy inputs Delusions preceding hallucination Modulate noise, recalibrate priors
Sequence Models Conditioning on own actions Self-delusion in adaptive behavior Causal interventions
LLMs Confident outputs from noisy/redundant data High-belief hallucinations (delusions) RAG, multi-agent debate, SM
RL/Planning Faulty/invalid self-generated subgoals Delusional planning, unreachable targets Hybrid relabeling, evaluator
VLMs Sycophancy, authority bias, logical inconsistency Psychologically-typed hallucination behaviors Taxonomy-guided training

This synthesis integrates and organizes key results from evolutionary theory, computational psychiatry, machine learning, and cognitive modeling, demonstrating both the pervasiveness and the multifaceted nature of auto-suggestive delusions in both natural and artificial agents.