Step by Step Loss Goes Very Far: Multi-Step Quantization for Adversarial Text Attacks

Published 10 Feb 2023 in cs.CL | (2302.05120v1)

Abstract: We propose a novel gradient-based attack against transformer-based LLMs that searches for an adversarial example in a continuous space of token probabilities. Our algorithm mitigates the gap between adversarial loss for continuous and discrete text representations by performing multi-step quantization in a quantization-compensation loop. Experiments show that our method significantly outperforms other approaches on various NLP tasks.