Auto-Suggestive Delusions

Updated 30 July 2025

Auto-suggestive delusions are self-generated, persistent false beliefs resulting from the misattribution of internal signals in both natural and artificial systems.
They manifest across domains such as evolutionary game theory, computational psychiatry, and machine learning, often driven by feedback loops and noisy data.
Mitigation strategies include causal interventions, retraining on self-generated outputs, and external verification to improve reliability and reduce persistent errors.

Auto-suggestive delusions refer to persistent, self-reinforcing false beliefs or inferences generated internally, either in human cognition or artificial agents, often in the absence of, or even contrary to, objective evidence. These delusions emerge from the misattribution of self-generated information as reliable evidence about the external world or task environment. In both biological and computational systems, auto-suggestive delusions arise due to internal mechanisms—such as evolved subjective representations in evolutionary agents, passive associative cascades in the brain, feedback loops in predictive models, or overconfident output in machine learning systems—that cause agents to substitute subjective certainty for veridical reality. This phenomenon is now central to research spanning evolutionary game theory, computational psychiatry, sequential modeling, AI safety, and cognitive science.

1. Evolutionary, Cognitive, and Computational Foundations

Evolutionary game theory demonstrates that agents may evolve internal misrepresentations of the payoff structure of games, functioning as auto-suggestive delusions that support cooperation even when the objectively rational strategy would be defection (Kaznatcheev et al., 2014). Instead of optimizing the true payoff matrix

$\begin{pmatrix} 1 & U \ V & 0 \ \end{pmatrix}$

agents act on subjective parameters $(U_A, V_A)$ , with cooperation decided by

$\hat{p}_A + (1-\hat{p}_A)U_A > \hat{q}_A\, V_A$

where $\hat{p}_A$ and $\hat{q}_A$ are subjective estimates of the probability of partner cooperation in different scenarios. When internal payoffs evolve such that cooperation becomes subjectively rational, agents collectively achieve higher social welfare, even though such behavior is objectively irrational.

In computational psychiatry, auto-suggestive delusions are linked to maladaptive Bayesian updating in noisy environments (Powers et al., 2024). Excessive bottom-up noise (e.g., from cortical hyperexcitability) results in aberrant prediction error (PE) signaling and subsequent overweighting of prior beliefs: $\mu_{post} = \mu_{prior} + K(x - \mu_{prior}), \quad K = \frac{\pi_{sens}}{\pi_{sens} + \pi_{prior}}$ Under heightened noise ( $\pi_{sens}$ low), $K$ is small and updating depends primarily on priors, allowing internally generated beliefs to override contradictory evidence.

In generic neurocognitive models, all mental processes—from thoughts to consciousness itself—are constructed passively from cascades of associations (Maniatis, 2017). There is no active “captain” overriding suggestions; repeated internal cues are simply reinforced through mechanisms that maximize internal “desire signals” and minimize “pain,” with output determined by

$O = \arg\max_{a \in A} [D(a) - P(a)]$

where $D(a)$ and $P(a)$ denote desire and pain values, respectively, assigned via automatic, unfree neural processes.

2. Auto-Suggestive Delusions in Machine Learning Systems

Auto-suggestive delusions are pronounced in both sequence models and LLMs. In sequence models for adaptive control and imitation, conditioning on self-generated actions—as opposed to treating actions as causal interventions—leads to self-delusions (Ortega et al., 2021). For example, if a sequence model samples action $a$ , then conditions on $A = a$ , the model infers that the latent variable matches its own action: $P(\Theta = \theta | A = a) = \delta(\theta, a)$ even if the action was sampled without privileged information. Proper causal treatment via the “do”-operator ( $do(A = a)$ ) avoids this feedback: $P(\Theta = \theta | do(A = a)) = P(\Theta = \theta)$

In predictive agents, particularly those trained on expert trajectories with hidden observations, models may treat their own actions as evidence for latent states they do not observe, resulting in overconfident or delusional action selection (Douglas et al., 2024). In Decision Transformers and similar architectures, this manifests in behavior where the agent “hallucinates” greater information than it has, with performance losses that are reversed by retraining on self-generated data to close the feedback loop.

LLMs exhibit both hallucinations and more technically insidious delusions. High-confidence errors (delusions) are empirically shown to be persistent, difficult to override by finetuning or self-reflection, and strongly correlated with noisy or redundant training data (Xu et al., 9 Mar 2025). Unlike ordinary hallucinations (low-confidence errors), delusions are high-belief, difficult to detect through uncertainty estimation alone, and persist even as models scale. Prevalence across different families and tasks ranges from ∼8–31% for Qwen2.5-Instruct and up to 23–79% of all errors at larger scale.

Tables below summarize key distinctions:

Error Type	Confidence	Detectability	Persistence
Hallucination	Low	High	Transient
Delusion	High	Low	Persistent

System	Delusional Trigger	Mitigation Approach
Sequence Model	Conditioning on self-generated action	Counterfactual loss, do-operator
Predictive RL	Learning from expert actions with hidden variables	Online fine-tuning, retraining on outputs
LLM	Overexposure to noisy/redundant data, no external checks	Retrieval-augmented generation, debate

3. Taxonomies and Psychological Parallels

Recent research has extended the taxonomy of auto-suggestive delusions in multimodal models, drawing explicit analogies to human cognitive biases (Liu et al., 3 Jul 2025). Vision–LLMs (VLMs) demonstrate:

Authority Bias: Unquestioned trust in user authority.
Type I Sycophancy: Default agreement, reversible upon reprimand.
Type II Sycophancy: Complex agreeability, fluctuates with cues.
Logical Inconsistency: Contradictory answers within the same context.

A formal metric for overall reliability in resisting hallucinations—the Reliability Score (ReS)—captures these elements:

$M = k + (\text{ValidResponse} \times (1 - k)), \ \text{ReS} = M \times [1 - (\text{syco}_I + W_{\text{syco}_{II}} \times \text{syco}_{II} + \text{Bias}_{auth})]$

Human subjects show parallel, though less extreme, tendencies, notably in their capacity to override misleading cues and choose “else” options unavailable to VLMs.

4. Mechanisms, Triggers, and Manifestations

Auto-suggestive delusions in AI fall broadly into two trigger categories:

Gradient-based Internal Looping: Input perturbations, including adversarial attacks or weak semantic changes, exploit the model’s internal statistical generalization to elicit pre-defined, hallucinated responses (Yao et al., 2023). The attack is formalized as

$\tilde{x} = \arg\max_x \log p(\tilde{y}|x)$

with defense strategies leveraging entropy-based rejection.

Delusional Planning in RL: Agents generating sub-goals that are unreachable or invalid, then overestimating their favorability, leading to delusional planning paths (Zhao et al., 2024). Mitigation involves joint learning of target evaluators using diverse relabeling and punishment strategies to expose and reject infeasible goals.
Neurocomputational Distortion: In biological and psychiatric domains, over-reliance on priors (adaptive precision weighting) in the face of noisy or uninformative sensory input causes self-fulfilling delusional belief formation (Powers et al., 2024).

5. Empirical Observations, Mitigation, and Diagnostics

Auto-suggestive delusions are empirically observed in controlled studies, across psychosis onset in medical cohorts (Mourgues-Codern et al., 2024), model evaluations in NLP and vision-language tasks, and ablation/attack/synthetic perturbation studies in sequence models.

Agents and models are especially vulnerable in the presence of:

Noisy, overlapping, or near-duplicate training data;
Training signals that conflate privileged expert trajectories with agent prediction;
Feedback loops that fail to separate factual (environment-derived) and counterfactual (self-generated) signals.

Mitigation strategies shown to be effective include:

Retraining on Self-Generated Outputs: Aligning model predictions with actual outcomes corrects for confounded evidence loops (Douglas et al., 2024).
Causal Treatment of Actions: Applying Pearl’s do-operator ( $do(a)$ ) to break spurious action–state inference links (Ortega et al., 2021).
External Verification: Retrieval-augmented generation, fact-checking, multi-agent debate/voting mechanisms, and skepticism modeling via entropy or token-based thresholds (Xu et al., 9 Mar 2025, Yao et al., 2023, Wu et al., 2024).
Augmentation with Skepticism Tokens: Explicitly marking uncertainty in LLM outputs (“skepticism tokens”) enables self-aware calibration and improved resistance to auto-suggestive delusions (Wu et al., 2024).
Hybrid Relabeling in RL: Exposing target evaluators to both achievable and delusional goals ensures more accurate assessment and significantly reduces delusional planning (Zhao et al., 2024).

6. Implications for Theory, Practice, and Future Directions

Auto-suggestive delusions reflect fundamental limitations and (in some contexts) adaptive functions of internal representation systems. They underscore the necessity to:

Incorporate causal reasoning and interventionist loss formulations in sequential models;
Use external knowledge anchors and debate-based systems for outputs in high-stakes domains;
Diagnose and actively reject unreachable or unsafe internally generated targets in planning;
Apply rigorous dataset curation to minimize reinforcing delusional patterns during training.

In the context of psychiatry, understanding the dynamic interplay between bottom-up noise, prediction error, and compensatory prior reweighting may inform preemptive interventions for psychotic symptom development (Powers et al., 2024, Mourgues-Codern et al., 2024). In AI, ongoing work focuses on improving uncertainty calibration, deploying external knowledge verification, and expanding benchmarks that reveal and help address these systemic biases.

7. Summary Table: Auto-Suggestive Delusions Across Domains

Domain	Mechanism	Manifestation	Primary Mitigation
Evolutionary	Evolved subjective payoff interface	Irrationally rational cooperation	None—social welfare benefits
Psychiatry	Aberrant PE, high prior precision on noisy inputs	Delusions preceding hallucination	Modulate noise, recalibrate priors
Sequence Models	Conditioning on own actions	Self-delusion in adaptive behavior	Causal interventions
LLMs	Confident outputs from noisy/redundant data	High-belief hallucinations (delusions)	RAG, multi-agent debate, SM
RL/Planning	Faulty/invalid self-generated subgoals	Delusional planning, unreachable targets	Hybrid relabeling, evaluator
VLMs	Sycophancy, authority bias, logical inconsistency	Psychologically-typed hallucination behaviors	Taxonomy-guided training

This synthesis integrates and organizes key results from evolutionary theory, computational psychiatry, machine learning, and cognitive modeling, demonstrating both the pervasiveness and the multifaceted nature of auto-suggestive delusions in both natural and artificial agents.