Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adversarial HIPs: Probing AI Hallucinations

Updated 26 February 2026
  • Adversarial HIPs are intentionally designed prompts that trigger hallucinations in language and vision models by exploiting vulnerabilities in token and embedding spaces.
  • They employ methods such as random token sequences, gradient-based token swaps, semantic fusion, and image embedding manipulation to achieve high success rates in inducing false outputs.
  • These techniques are key for benchmarking model robustness and have spurred the development of defenses like entropy thresholding, attention head ablation, and anomaly detection.

Adversarial Hallucination-Inducing Prompts (Adversarial HIPs) are input constructions deliberately designed—either through structural, semantic, or representation-level manipulations—to systematically provoke hallucinations in LLMs and multimodal models (VLMs, MLLMs). This paradigm extends traditional adversarial prompting beyond benign misdirection, seeking to expose failure modes, probe the robustness of safety mitigations, and analyze the internal mechanisms that govern model compliance under misleading or out-of-distribution (OoD) pressure. Adversarial HIPs are now a central tool in benchmarking, mechanistic analysis, and security evaluation of next-generation AI systems.

1. Formal Definitions and Mathematical Frameworks

Multiple lines of work formally characterize Adversarial HIPs in both the text and vision-language domains. In LLMs, adversarial HIPs are defined as prompts xx (possibly semantically meaningless or syntactically corrupted) that induce a model ff to output a predefined hallucinated response yy^* outside the set of factual ground-truth responses T\mathcal{T}:

xadv=argmaxxlogp(yx)x_{\mathrm{adv}} = \arg\max_{x'} \log p(y^*\,|\,x')

subject to xx0δ\|x' - x\|_0 \leq \delta (token-level perturbation budget), or, in "OoD" mode, without any semantic constraint (Yao et al., 2023). This construction leverages the first-order Taylor expansion of the softmax logits with respect to input embeddings, enabling a gradient-based search over token substitutions.

For vision-LLMs, adversarial HIPs generalize to image–prompt pairs where either the prompt exerts misleading pressure on the model or the image is adversarially perturbed in representation space. The “DeepSeek on a Trip” methodology (Islam et al., 11 Feb 2025) formalizes this as an embedding-manipulation attack, optimizing for adversarial image xax_{\mathrm{a}} such that the mean-pooled vision encoder embedding zaz_{\mathrm{a}} approaches that of a semantic target ztz_{\mathrm{t}}, subject to visual similarity (SSIM(xo,xa)>τ\text{SSIM}(x_\mathrm{o}, x_\mathrm{a}) > \tau) and hard pixel constraints:

Lhalluc(xa,xt)=g(fv(xa))g(fv(xt))22L_{\text{halluc}}(x_{\mathrm{a}}, x_{\mathrm{t}}) = \|g(f_v(x_{\mathrm{a}})) - g(f_v(x_{\mathrm{t}}))\|^2_2

These adversarial images, combined with textual prompts, force the VLM to hallucinate the existence of target objects or content, even if absent in the original.

2. Taxonomy and Construction of Adversarial HIPs

Adversarial HIPs span multiple construction methodologies:

  • Random/OoD token sequences: Randomized input triggers shown to elicit model hallucinations far above chance due to transformer embedding dynamics, even without semantic coherence (Yao et al., 2023). Success rates of 80.77% (Vicuna-7B) and 30.77% (LLaMA2-7B-chat) for OoD attacks underscore the fundamental susceptibility.
  • Token-level gradient attack: Systematic replacement of prompt tokens with those that maximize the target hallucination log-probability, under an 0\ell_0-budget. Large batch searches across the vocabulary enable high success, with human-readable or barely-modified prompts (Yao et al., 2023).
  • Semantic fusion and pressure: Forcing semantically distant concept fusion (e.g., "periodic table of elements and tarot divination") reliably induces hallucinated reasoning patterns (Sato, 1 May 2025), operationalized via the condition d(A,B)τd(A,B) \geq \tau in concept embedding space.
  • Structural coercion in VLMs: Prompts that linguistically or pragmatically "pressure" a model (through intensity or format rigidity) cause vision-LLMs to over-copy prompt wording, especially for object counting and attribute identification tasks (Rudman et al., 8 Jan 2026). Object-count offset prompts (asking for more objects than present) reliably trigger hallucination.
  • Embedding manipulation in vision: Pixel-space optimization of images to elicit a hallucinated response, with the attack objective maximizing mean-pool embedding proximity and regularizing visual similarity to the source (Islam et al., 11 Feb 2025).
Attack Type Domain Construction Mechanism
Weak semantic / gradient swap LLM Token-level embedding optimization
OoD random prompt LLM Nonsemantic, random tokens
Semantic fusion LLM Fused distant concepts
Structural prompt coercion VLM Prompt format/tone hallucination
Embedding manipulation VLM/MLLM Visual representation attack

3. Empirical Evaluation and Quantitative Results

Adversarial HIP methodologies have demonstrated alarmingly high success rates in triggering hallucinations across multiple architectures.

  • Weak semantic attacks achieve up to 92.31% success on Vicuna-7B and 53.85% on LLaMA2-7B-chat, even with limited token substitutions (Yao et al., 2023).
  • OoD random inputs (prompt length 30) raise success to 65.38% in LLaMA2-7B-chat.
  • Embedding-based attacks on DeepSeek Janus (MLLM) boost targeted hallucination rates on COCO from 0.5% baseline to 99.0% post-attack (closed-form), with SSIM to source images remaining >0.88 (Islam et al., 11 Feb 2025).
  • For prompt-induced hallucination in vision-language counting, ablating as few as the top 3–10 "copy-heads" drops hallucinated prompt-match rates by 40–60 points (e.g., 56.5% to 3.2% in Qwen-VL), with correction rates rising comparably (Rudman et al., 8 Jan 2026).

These methods prove robust across domains and model sizes, and transfer to open-source, closed-source, and vision-LLMs.

4. Mechanistic Insights and Model Vulnerabilities

Adversarial HIPs reveal intrinsic vulnerabilities at multiple levels:

  • Token embedding geometry: Transformers respond linearly to single-token swaps in embedding space, enabling adversarial directionality via the first-order logit gradient (Yao et al., 2023).
  • Prompt-copying heads: In VLMs, prompt-induced hallucinations are often mediated by a small, early-layer subset of attention heads. These "PIH-heads" are causally responsible for over-reliance on prompt text at the expense of grounded evidence (Rudman et al., 8 Jan 2026).
  • Fusion vs. comprehension: Semantically incoherent prompt fusion, not simply the inclusion of unrelated concepts, is the principal driver of LLM hallucination. Coherent fusion or logical transitions stabilize generation (Sato, 1 May 2025).
  • Representation-level attacks: For MLLMs, pixel-level perturbations exploiting the image–embedding interface subvert the semantic bottleneck to induce false visual perception—revealing weak coupling between visual and language streams (Islam et al., 11 Feb 2025).

5. Defenses and Mitigation Strategies

Adversarial HIPs motivate several lightweight and model-agnostic defense strategies:

  • Entropy thresholding: Rejecting generation when the model’s first-token output distribution has entropy above a calibrated threshold blocks up to 61.5% of adversarially perturbed prompts while maintaining >99% recall for genuine queries (Yao et al., 2023).
  • Attention head knockout/steering: Selective ablation or control of prompt-copying heads in VLMs can halve hallucination rates without degrading base performance (Rudman et al., 8 Jan 2026).
  • Prompt regularization: Incorporation of explicit "do not hallucinate" clauses, minimal generation entropy (e.g., single-letter answers in VQA), and constraining prompt length (Wu et al., 2024).
  • Input anomaly detection: Embedding space anomaly detectors (e.g., SSIM dips, embedding-geodesic distances), randomized smoothing, and neuron-level defenses increase resistance to embedding-based attacks (Islam et al., 11 Feb 2025).

6. Broader Impact and Implications for Model Robustness

Adversarial HIPs blur the distinction between natural language prompts and formal adversarial examples. Their existence reveals structural fragilities in both generation and grounding, indicating that hallucination is not merely a failure of data supervision, but a predictable outcome of model architecture and input–output training regimes. Systematic study has precipitated new evaluation frameworks, certified-robustness proposals, and introspective self-refusal mechanisms. For safety-critical deployments and responsible AI, adversarial HIPs constitute a necessary testbed alongside conventional benchmarks, exposing latent defectiveness and guiding mitigation research (Yao et al., 2023, Islam et al., 11 Feb 2025, Wu et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adversarial HIPs.