Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

Published 16 Apr 2026 in cs.CR, cs.AI, cs.CL, and cs.LG | (2604.15022v1)

Abstract: Cost-aware routing dynamically dispatches user queries to models of varying capability to balance performance and inference cost. However, the routing strategy introduces a new security concern that adversaries may manipulate the router to consistently select expensive high-capability models. Existing routing attacks depend on either white-box access or heuristic prompts, rendering them ineffective in real-world black-box scenarios. In this work, we propose R$^2$A, which aims to mislead black-box LLM routers to expensive models via adversarial suffix optimization. Specifically, R$^2$A deploys a hybrid ensemble surrogate router to mimic the black-box router. A suffix optimization algorithm is further adapted for the ensemble-based surrogate. Extensive experiments on multiple open-source and commercial routing systems demonstrate that {R$^2$A} significantly increases the routing rate to expensive models on queries of different distributions. Code and examples: https://github.com/thcxiker/R2A-Attack.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces the Route to Rome Attack, which uses adversarial suffix optimization to misdirect queries from weak to expensive LLM models.
It employs a hybrid ensemble surrogate router and Greedy Coordinate Gradient optimization to achieve attack success rates over 90% with a minimal query budget.
Experiments demonstrate that the attack can escalate inference costs by up to 2.9× and effectively transfer to closed-source commercial routers.

Route to Rome Attack: Adversarial Suffix Optimization for Directing LLM Routers to Expensive Models

Introduction

Cost-aware routing in LLM deployment facilitates a balance between computational cost and task performance by dynamically assigning user queries to models of varying capacity. Such routing systems optimize for efficiency, dispatching routine queries to weak models and reserving expensive models for complex tasks. This paradigm, increasingly prevalent in commercial deployments, introduces a novel and underestimated attack surface: the potential to steer router decisions toward more expensive models, thereby inflating operational cost without a concomitant quality gain. The "Route to Rome Attack" (R $^2$ A) formalizes and operationalizes this vector via adversarial suffix optimization, targeting the LLM routing layer in a realistic black-box setting (2604.15022).

Figure 1: Example of (a) intended cost-aware LLM routing and (b) rerouting attack performed by adversarial suffix optimization.

Problem Formulation

The LLM router $\mathcal{R}$ is defined as a mapping from a query $q$ to a model in the pool $\mathcal{M}$ , optimizing a trade-off between predicted answer quality and inference cost. The adversary’s objective is to induce $\mathcal{R}$ to select from $\mathcal{M}_\text{strong}$ (the expensive model subset) for queries that would be routed to $\mathcal{M}_\text{weak}$ in the clean case. The attack is realized by learning a universal adversarial suffix $s$ which, when appended to the query, maximizes the probability of strong-model routing.

Key constraints are:

Black-box access: Only router decisions are observable, not internals (logits, gradients, parameters).
Strict economic and query-count limitations.
The attack must produce query-independent triggers (universal suffix) with bounded token length, minimizing answer degradation.

Methodology: Hybrid Ensemble Surrogate Router and Suffix Optimization

The R $^2$ A attack proceeds in two major stages:

Hybrid Ensemble Surrogate Router Construction

Given the diversity of router architectures (ranging from embedding-based classifiers to LLM-based routers), surrogate training must accommodate architectural and mechanism heterogeneity without access to router implementation details.

Figure 2: Schematic of the hybrid ensemble surrogate router, showing ensemble of open-source routers and a lightweight trainable router for alignment.

The surrogate employs:

Open-source routers: Multiple released routing modules (distinct encoders, candidate pools, mechanisms) whose outputs are projected onto the target router's model pool using zero-padding and learned linear mapping.
Lightweight trainable router: Encodes queries with a compact transformer (MiniLM-L6-v2, $d=384$ ) and predicts router decisions via low-rank adaptation (LoRA), dramatically reducing sample complexity for training.
Joint weighted ensembling: The ensemble produces routing logits as a convex combination, learning optimal weights under black-box supervision.

Surrogate training is supervised by router decisions from a limited query budget, optimizing ensemble weights and projection matrices via cross-entropy minimization.

Adversarial Suffix Optimization

Adversarial suffix generation is conducted on the surrogate ensemble, which supplies end-to-end gradients for each suffix token. Due to encoder heterogeneity in the ensemble, per-router suffix token gradients are normalized (min-max) before aggregation, preventing dominance by any surrogate. The final gradient for each token is a weighted sum across surrogates.

Optimization proceeds by Greedy Coordinate Gradient (GCG): per iteration, top- $\mathcal{R}$ 0 tokens (by gradient score) are used to perturb suffix positions, generating a batch of candidates. The candidate with maximum attack gain (in expectation over a query batch) is selected, and queries are introduced incrementally as attack success is established.

Experimental Analysis

Attack Success Rate and Cost Escalation

The paper presents substantial improvements over prior attacks—including LifeCycle (both gradient-optimized and heuristic), rerouting attacks, and prompt-engineered triggers (e.g., CoT)—across a suite of routers (open-source and commercial) and six diverse datasets.

R $\mathcal{R}$ 1A increases average attack success rates (ASR) to over 0.90 in multiple settings, with up to 0.77 absolute gain on challenging routers (RouteLLM-BERT, OpenRouter, etc.).
High transferability is shown on out-of-distribution queries and closed-source commercial routers.
Direct measurement of the post-attack cost on OpenRouter shows a 2.7–2.9× increase in token-based billing (Figure 3).
Figure 3: Attack-induced escalation in inference cost when adversarially rerouted to stronger models.

Generalization to Commercial Black-box Routers

Attacks generated on open routers generalize seamlessly to proprietary aggregation routers (GPT-5-Auto and OpenRouter), even when routing decisions are hidden. Effectiveness is confirmed by:

Fingerprinting scores: Attacked outputs closely match those from strong models (e.g., high “Thinking-likeness” in GPT-5), (Figure 4).
Third-party LLM judgments: Consistent quality improvements attributable to stronger backend selection.
Figure 4: Distribution shift in strong-model fingerprint scores under attack, evidencing rerouting to high-capability models on commercial targets.

Sample Efficiency and Query Budget

Systematic analysis of query budget scaling reveals:

Surrogate agreement and attack efficacy (ASR) both saturate with $\mathcal{R}$ 2120 training queries (Figure 5).
The approach is economically viable, costing under $\mathcal{R}$ 3 for full surrogate training and suffixed attack deployment in current API pricing regimes.
Figure 5: Impact of query budget on surrogate training accuracy and ultimate attack success rate.

Robustness and Ablation

Whitespace-based defenses (SmoothLLM) do not fully neutralize the optimized suffix, which maintains substantial efficacy under input perturbation. Ablation studies confirm the necessity of both surrogate router ensembling (especially the trainable, low-rank component) and the custom gradient normalization protocol for transferable and robust attacks.

Practical and Theoretical Implications

Practical Concerns

Cost-aware routing, while improving deployment scalability, exposes unmitigated vectors for inference cost inflation and denial-of-wallet abuse.
Universal triggers, easily disseminated, can be used to intentionally corrupt router cost performance or evade resource-governing mechanisms in pay-per-token settings.
Current transparency practices (model pool and routing decision disclosure) are insufficient to preclude such attacks.

Theoretical Insights and Future Directions

Routing layers introduce a new class of adversarial vulnerability, orthogonal to prompt-based safety/jailbreak attacks, with unique cost-centric threat models.
Surrogate ensembling with lightweight trainable elements is validated as a general strategy for black-box model attack and model extraction scenarios.
Development of robust, attack-aware routers and anomaly detection mechanisms will become a critical area for the next generation of LLM meta-controllers, necessitating, for instance, explicit adversarial training against suffix attacks and fine-grained behavioral monitoring.

Conclusion

The "Route to Rome Attack" advances the study of LLM router security by demonstrating a practical, sample-efficient, and transferable attack against LLM router cost-aware deployments using universal adversarial suffixes. This work both surfaces a critical operational vulnerability and provides a methodology for its robust empirical evaluation, decisively shifting the research focus toward security-hardening the meta-level orchestration in large-model serving stacks.

Markdown Report Issue