Transferability of adversarial batch effects across prompts from the same distribution

Determine whether inputs sampled from the same distribution as a targeted prompt x* exhibit sufficiently similar expert routing assignments in Mixture-of-Experts transformer models such that an adversarially optimized set of batch inputs, constructed against x*, is more likely to affect their outputs when batched together.

Background

The paper demonstrates a proof-of-concept batch-level attack on Mixture-of-Experts (MoE) transformers that exploits finite per-expert buffer capacities and batch-order-dependent routing. The attacker optimizes a set of adversarial inputs to fill preferred expert buffers, thereby altering the routing and output of a target input x*.

To assess whether such adversarial inputs generalize beyond the single target, the authors examine transfer to other arithmetic prompts. They explicitly conjecture that prompts drawn from the same distribution as the target may share similar expert routing, and thus be more susceptible to the adversarially optimized batch. This raises a concrete question about the conditions under which adversarial batch effects transfer across inputs due to routing similarity in MoE models.

References

We conjecture that if these other data-points are sampled from the same distribution as $x*$, they are more likely to have similar expert routing assignments, and so are more likely to be affected by $\tilde{X}{\mathcal{A}$, which was optimized specifically for $x*$.

Buffer Overflow in Mixture of Experts (2402.05526 - Hayes et al., 8 Feb 2024) in Section 4: Anecdotal evidence of transferability to different prompts