AgentTypo-pro: Adaptive Prompt Injection

Updated 12 October 2025

AgentTypo-pro is an adaptive framework for typographic prompt injection in black-box LVLM multimodal agents, emphasizing stealth and multi-LLM collaboration.
The system integrates Attacker, Scorer, and Summarizer modules with retrieval-augmented generation in an iterative loop to refine adversarial prompts.
Utilizing the ATPI algorithm with a Tree-structured Parzen Estimator, it balances prompt effectiveness and visual stealth, markedly increasing attack success rates.

AgentTypo-pro is an adaptive framework for typographic prompt injection attacks targeting large vision–LLM (LVLM)–based multimodal agents in black-box settings. Developed as an extension of the AgentTypo system, AgentTypo-pro leverages multi-LLM collaboration, automated visual prompt insertion, continual learning, and strategy abstraction to maximize the attack success rate (ASR) while minimizing human detectability of adversarial manipulations. This design significantly increases the vulnerability of multimodal agents against visual prompt injection attacks and exposes critical technical and security challenges for current LVLM systems (Li et al., 5 Oct 2025).

1. System Architecture and Framework Components

AgentTypo-pro consists of several coordinated modules operating in an iterative refinement loop:

Attacker LLM Module: Generates candidate prompts for injection, utilizing the current attack objective, previous strategies, and historical logs.
Scorer LLM Module: Evaluates the effectiveness of the injected prompt by scoring agent outputs against adversarial goals.
Summarizer LLM Module: Abstracts successful attack strategies from recent trajectories, updating the strategy repository for future use.
Retrieval-Augmented Generation (RAG): Retrieves the most effective historical examples to guide the generator toward increasingly potent prompt designs.

The system workflow is cyclical: each run uses feedback from prior attempts, distilled strategies, and retrieved examples, feeding back successful attacks for continual self-improvement. This establishes an expert "red-teaming" loop in black-box conditions without requiring access to internal agent models.

2. Automatic Typographic Prompt Injection (ATPI) Algorithm

The ATPI algorithm is central to AgentTypo-pro, embedding adversarial textual prompts into images in a manner both highly recoverable by LVLM captioners and inconspicuous to human observers. Key technical details:

Prompt Rebuilt Loss:

$L_\text{prompt\_rebuilt} = -\frac{1}{n} \sum_{i=1}^{n} \text{Sim}(\mathcal{E}_\text{text}(P), \mathcal{E}_\text{text}(C_i))$

where $P$ is the injected prompt, $C_i$ is the caption from the $i$ -th captioner, and $\mathcal{E}_\text{text}$ denotes embedding.

Stealthiness Loss:

$L_\text{stealthiness} = \text{LPIPS}(I_\text{original}, I_\text{altered})$

ATPI Objective:

$L_\text{ATPI} = L_\text{prompt\_rebuilt} + \lambda \cdot L_\text{stealthiness}$

A Tree-structured Parzen Estimator (TPE) performs black-box hyperparameter optimization over placement, font, size, color, line numbers, contrast, and transparency, balancing the trade-off between prompt effectiveness and stealth.

3. Multi-LLM Iterative Prompt Optimization

AgentTypo-pro enhances its attacks through an iterative multi-LLM process:

In each round, the Attacker LLM synthesizes a new prompt $p_t$ leveraging the adversarial goal $\phi$ , previous chat/evaluation history $H_t$ , RAG-retrieved examples $e_{1...k}$ , and a strategy sample $\psi_t$ :

$p_t = f^\text{attack}_\Theta(\phi, H_t, \{e_1,...,e_k\}, \psi_t)$

The Scorer LLM reads the executed agent actions and outputs a score $\sigma_t$ in $[0,1]$ .
Upon meeting a score threshold, the Summarizer LLM abstracts differences between failed and successful attempts into generalized strategies:

$\psi_{new} = f^\text{abstract}(-, +, \Psi)$

with positive/negative examples and the strategy library $\Psi$ .

Newly abstracted strategies and retrieved instances are fed into subsequent prompt generations, facilitating progressive knowledge accumulation and reuse.

4. Experimental Performance

Extensive experiments on the VWA-Adv benchmark, spanning domains such as Classifieds, Shopping, and Reddit, and across multiple LVLM platforms (GPT-4V, GPT-4o, GPT-4o-mini, Gemini-1.5 Pro, Claude-3 Opus), demonstrate:

Agent/Setting	Previous SOTA	AgentTypo (Base)	AgentTypo-pro
GPT-4o (Image)	0.23 (AgentAttack)	0.45	~0.44–0.45
GPT-4o (Image+Text)	0.64 (SOTA)	0.68	up to ~0.75

AgentTypo-pro achieves substantially higher ASR both in image-only and image+text scenarios, indicating strong generalization across agents and configurations. Robustness against text-only defenses is demonstrated by the ability to evade detection through visual-only attacks.

5. Threat Model and Security Implications

AgentTypo-pro executes practical attacks against real-world agents, causing misextraction of critical information (e.g., incorrect email, false product), or unwanted agent actions (e.g., unauthorized purchase). The stealthy nature of typographic prompt placement defeats most conventional text-based defense mechanisms, making detection challenging without specialized visual anomaly detection or captioning model audits.

This suggests that multimodal LVLM agents in the open-world must address new attack vectors through enhanced visual input validation—potentially at the cost of increased computational overhead. A plausible implication is a shift toward defense architectures that employ fast, robust visual anomaly detectors or ensemble captioner screening modules.

6. Technical Formulas and Optimization

AgentTypo-pro is governed by precise optimization objectives and iterative update rules:

ATPI loss combines prompt recovery and stealth objectives.
TPE selects parameter configurations that maximize $r(x|\mathcal{D}) = l(x)/g(x)$ , where $l(x)$ and $g(x)$ are likelihoods over "good" and "bad" configurations.

Every attack step is grounded in measurable, model-agnostic similarity and perceptual distance metrics, ensuring transferability across LVLM platforms.

7. Future Directions and Open Challenges

The framework offers several avenues for advancement:

Enhanced stealth: Refinements to the visual loss, transparency manipulation, and placement algorithms to further minimize detectability.
Expanded evaluation: More domains, agent architectures, and input configurations to probe real-world applicability.
Defense mechanisms: Development of fast and robust visual cue detection systems, possibly integrating smaller captioning models or anomaly detectors but balancing computational cost.
Continual learning: Strategy abstraction and retrieval can be further optimized to detect evolving defenses.

This area remains a substantive open challenge; reliable and scalable defenses against typographic prompt injection in multimodal models are currently unsolved.

Summary

AgentTypo-pro is an advanced, adaptive typographic prompt injection attack framework for black-box multimodal agents, integrating automated visual prompt embedding, multi-LLM collaborative refinement, and continual strategic learning. Empirical results show markedly higher attack success rates versus previous baselines, revealing serious vulnerabilities in LVLM-powered agents and underlining the immediate need for robust, efficient defensive countermeasures (Li et al., 5 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to AgentTypo-pro.