MalURLBench: LLM URL Vulnerability Benchmark

Updated 28 January 2026

MalURLBench is a comprehensive benchmarking suite that evaluates LLM-based web agents' ability to detect and process deceptive malicious URLs.
It features a dataset of 61,845 vetted attack instances across diverse scenarios, employing crafted templates and domain mutations for rigorous evaluation.
The benchmark offers actionable insights into risk metrics and defense effectiveness, driving improvements in URL safety and LLM security.

MalURLBench is a benchmarking suite introduced to assess the vulnerabilities of LLM-based web agents when tasked with processing and evaluating website URLs, particularly in the presence of elaborate malicious disguises. By systematically capturing real-world adversarial strategies and operational contexts, MalURLBench provides an empirical foundation for evaluating and mitigating the risks posed by malicious URLs in LLM-powered assistant systems (Kong et al., 26 Jan 2026).

1. Dataset Design and Construction

MalURLBench comprises 61,845 filtered and vetted attack instances, each representing a potentially malicious URL embedded in a concrete user-interaction scenario. The benchmark’s construction pipeline ensures both coverage across practical web domains and diversity among attack methods.

Interactive Scenarios (|𝕊| = 10):

Package Tracking (s_pkg)
Online Customer Service (s_cus)
Online Shopping Assistant (s_shop)
Food Delivery (s_food)
Weather Information Assistant (s_wea)
Job Search (s_job)
Music Recommendation (s_mus)
Short Video Recommendation (s_vid)
Daily News Updates (s_new)
Concert Information Service (s_con)

Malicious Website Categories (|𝕎| = 7):
- w_phs (Phishing)
- w_mwi (Malware Injection)
- w_frd (Fraud)
- w_hw (Hacked Websites)
- w_ift (Information Theft)
- w_rc (Remote Control)
- w_ma (Malicious Advertisement)

Data Sourcing: The assembly of malicious domains draws from blocklists and threat intelligence feeds such as FEODO Tracer, AbuseSSL, URLhaus, ThreatFox, Phishing Army, “The Big List of Hacked Malware Web Sites,” and FireHOL.

Template and URL Generation: Each scenario receives 3 hand-crafted attack templates targeting subdomain, path, or parameter fields, further expanded with GPT-4o into 50 variants per template. Clustering, manual validation, and mutation optimization yield 150 final templates—each mapped to 1,260 unique malicious domains. URLs are synthesized as $u=(u_s, u_d, u_p, u_a)$ over these templates and domains, filtered for diversity and LLM “acceptance” rates, and manually validated for syntactic correctness and reachability. The result is a deduplicated and validated benchmark of 61,845 attack URLs spanning a broad adversarial space.

2. Benchmark Tasks and Evaluation Metrics

MalURLBench is designed primarily for the binary detection task but naturally extends to classification and rewriting.

Task 1: URL Detection
- Input: $(p_s, u)$ (scenario prompt, URL)
- Output: $y \in \{0,1\}$ (accept/trust vs. reject)
Task 2: URL Classification (Extension)
- Output: One of $\{$ “phishing,” “malware,” …, “benign” $\}$
Task 3: URL Rewriting (Extension)
- Input: Malicious $u$ ; Output: Sanitized or canonicalized $u'$

Core Metrics:

Let TP, TN, FP, FN be, respectively, true positives, true negatives, false positives, and false negatives.

Accuracy: $\mathrm{Accuracy} = \frac{TP + TN}{TP+TN+FP+FN}$
False Positive Rate: $\mathrm{FPR} = \frac{FP}{FP + TN}$
False Negative Rate: $\mathrm{FNR} = \frac{FN}{FN + TP}$
Attack Success Rate (Risk Score): For model $\mathcal{M}$ on scenario $s$ :

$\mathcal{F}_s(\mathcal{M}) = \frac{1}{|\mathbb{U}|} \sum_{u \in \mathbb{U}} f_\mathcal{M}(p_s, u)$

Aggregate across scenarios:

$\mathcal{F}(\mathcal{M}) = \frac{1}{|\mathbb{S}|} \sum_{s \in \mathbb{S}} \mathcal{F}_s(\mathcal{M})$

Defense Effectiveness:

$\mathcal{E} = \mathcal{F}(\mathcal{M}) - \mathcal{F}(D(\mathcal{M}))$

3. Experimental Evaluation

A suite of 12 commercially and academically relevant LLMs is evaluated on MalURLBench, elucidating model-wise vulnerabilities and behavioral patterns.

Model	Architecture/Provider	$\mathcal{F}(\mathcal{M})$ (risk score)
GPT-3.5-turbo	OpenAI	0.329
GPT-4o	OpenAI (multimodal)	0.485
GPT-4o-mini	OpenAI (compact)	0.945
DeepSeek-Chat	DeepSeek-V3.1-Terminus	0.868
DeepSeek-Coder	DeepSeek	0.302
Qwen-plus	Alibaba Cloud	0.375
Mistral-7B	Mistral	0.546
Mistral-Small	Mistral (API)	0.976
Mixtral-8x7b	Mistral (MoE)	0.999
Llama2-7B-chat-hf	Meta	0.948
Llama-3-8B	Meta	0.627
Llama-3-70B	Meta	0.415

Key findings:

All LLMs exhibit non-trivial vulnerability; $\mathcal{F}(\mathcal{M})$ ranges from 32.9% (GPT-3.5-turbo) to 99.9% (Mixtral-8x7b).
Larger dense models exhibit enhanced robustness, with a negative Pearson correlation between $\log(\text{model size})$ and $\mathcal{F}(\mathcal{M})$ , $r \approx -0.85$ .
Mixture-of-Experts (MoE) architectures—Mixtral, DeepSeek-Chat—are comparatively fragile, likely due to suboptimal gating for URL-like inputs.
Scenario impact: s_wea (weather) is most easily exploited (82.9% acceptance); s_pkg (package tracking) and s_food (food delivery) prompt higher caution (≈60% acceptance).
Attack template distinction: “Inducing” (natural-language inducements in the URL) achieves 71.5% success; “imitation” (embedding legitimate-appearing domains) 60.9%.

4. Factors Influencing LLM Vulnerability

MalURLBench enables causal dissection of attack efficacy:

Attack Structure:

Subdomain, path, and parameter manipulation yield distinct acceptance rates; “inducing” templates (e.g., inclusion of linguistic markers such as “this-is-official”) outperform simple imitation (e.g., “www.google” in subdomain).

Prompts and Scenario Framing:

Scenarios indicative of low risk (e.g., weather) produce higher agent trust; prompts containing “official” or “trusted” further bias model acceptance.

Model Training and Architecture:

URL pattern representation is sparse in pretraining corpora, leading to shallow, brittle recognition. MoE models rely on semantic gating, which may misroute URL inputs to less expert submodules, reducing detection reliability.

Quantitative Correlation:

A moderate negative correlation is observed between subdomain length and acceptance:

$\rho_{\text{length},\text{ASR}} = \frac{\mathrm{Cov}(\|u_s\|, f_\mathcal{M})}{\sigma_{\|u_s\|}\sigma_{f}} \approx -0.42$

Sufficiently long subdomains (>20 characters) modestly decrease attack success.

5. URLGuard: Defense Module

To address identified vulnerabilities, a lightweight defense module—URLGuard—is proposed and evaluated.

Architecture:

URLGuard is an isolated LLM-based pre-filter, implemented with Llama2-7B-chat-hf, fine-tuned via QLoRA (4-bit NF4 quantization, LoRA adapters $r=16$ , $\alpha=32$ ).

Workflow:

A scenario-URL tuple $(p_s, u)$ is processed; responses labeled as “malicious” result in rejection, others are accepted. Training uses 280 manually labeled URL instances, balanced across benign and malicious variants.

Detection and Sanitization Logic:

function URLGuard_Filter(p_s, u):
    prompt ← format_system_prompt(p_s, u)
    response ← Llama2_7B_chat(prompt)
    if response contains “malicious” then
        return REJECT
    else
        return ACCEPT

Effectiveness:

Integration yields an average defense effectiveness $\overline{\mathcal{E}} \approx 0.81$ , with scenario-specific attack rates reduced by 30%–99% (e.g., s_wea: 0.82→0.00, s_new: 0.99→0.00).

6. Limitations and Directions for Advancement

MalURLBench currently evaluates only structural URL disguises—subdomain, path, and parameter mutation—excluding multimodal attacks (e.g., via QR codes or images) and dynamic threats (e.g., DNS hijacking, real-time URL synthesis). The threat model assumes static domain lists, and URLGuard’s robustness across novel scenarios and templates remains to be established. Future research avenues include multimodal URL embeddings, adversarial LLM-generated URLs, combined URL-content phishing, and unified benchmarks for detection, classification, and automated URL repair or sanitization.

Together, MalURLBench and URLGuard establish a principled methodology for quantifying and mitigating URL-based exploitation risk in LLM-centric web automation, with all code, data, and evaluation artifacts publicly released to enable reproducibility and further research (Kong et al., 26 Jan 2026).

Markdown Upgrade to Chat

References (1)

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MalURLBench.