Towards Crafting Text Adversarial Samples (1707.02812v1)

Published 10 Jul 2017 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Adversarial samples are strategically modified samples, which are crafted with the purpose of fooling a classifier at hand. An attacker introduces specially crafted adversarial samples to a deployed classifier, which are being mis-classified by the classifier. However, the samples are perceived to be drawn from entirely different classes and thus it becomes hard to detect the adversarial samples. Most of the prior works have been focused on synthesizing adversarial samples in the image domain. In this paper, we propose a new method of crafting adversarial text samples by modification of the original samples. Modifications of the original text samples are done by deleting or replacing the important or salient words in the text or by introducing new words in the text sample. Our algorithm works best for the datasets which have sub-categories within each of the classes of examples. While crafting adversarial samples, one of the key constraint is to generate meaningful sentences which can at pass off as legitimate from language (English) viewpoint. Experimental results on IMDB movie review dataset for sentiment analysis and Twitter dataset for gender detection show the efficiency of our proposed method.

Citations (216)

View on Semantic Scholar

Summary

The paper introduces a method that strategically modifies key words, causing classifier accuracy to drop significantly (e.g., from 74.53% to 32.55% on IMDB reviews).
The approach harnesses synonym, typo, and genre-specific keyword pools to maintain syntactic and semantic integrity in the adversarial examples.
Experimental results on IMDB and Twitter datasets demonstrate the vulnerability of text classifiers and highlight improved resilience when retrained with adversarial samples.

Overview of Adversarial Text Sample Crafting

The paper "Towards Crafting Text Adversarial Samples" by Suranjana Samanta and Sameep Mehta addresses a relatively underexplored, yet significant facet of adversarial machine learning in the context of text data. It proposes a methodology for creating adversarial text samples aimed at misleading classifiers, with a particular focus on models used for sentiment analysis and gender detection. The research places emphasis on maintaining syntactic and semantic integrity within the modifications to ensure that the adversarial samples remain undetectable to humans while successfully confusing machine learning classifiers.

The authors acknowledge that adversarial attacks have predominantly been explored in the domain of image processing, where the continuous nature of image pixel values permits subtle perturbations. In contrast, text data, characterized by its discrete nature, poses unique challenges. Words cannot be modified or synthesized as flexibly as image pixels without jeopardizing comprehensibility or grammatical correctness. Thus, the paper's proposed method focuses on strategic modifications such as the insertion, deletion, or replacement of salient words, ensuring that generated adversarial samples maintain meaning and readability.

Significantly, the paper leverages the concept of a candidate pool containing synonyms, typos, and genre-specific keywords to introduce modifications that alter the classifier's perceived class of the text. The authors discuss methods for estimating the contribution of individual words to class predictions, using both gradient-based approaches and semantic contribution analyses to prioritize words for modification. Importantly, they highlight that their approach is specifically beneficial for datasets featuring sub-categories within class labels, using genres in movie reviews as a demonstrative example.

The evaluation of the proposed adversarial text crafting approach uses two datasets: the IMDB movie reviews dataset for sentiment analysis, and the Twitter dataset for gender classification. The results indicate a notable reduction in classification accuracy when the original models are tested on adversarial samples. For instance, the IMDB dataset witnessed a decrease from 74.53% to 32.55% in classification accuracy on adversarial texts crafted with genre-specific keywords. On retraining with adversarial samples, classifiers showed improved resilience, indicating the method’s potential to fortify models against such attacks.

The implications of this research are multifaceted. Practically, it emphasizes the vulnerability of text-based classifiers to adversarial attacks, underscoring the need for robust defenses in real-world applications such as sentiment analysis and identity detection in social media. Theoretically, it pushes the boundary of adversarial machine learning in text data, suggesting avenues for more nuanced adversarial attack models that consider linguistic intricacies. Future directions might involve refining the heuristics for word modification and exploring automated techniques to streamline adversarial crafting without extensive manual input.

In summary, this work contributes to the body of knowledge on adversarial machine learning by adapting techniques to text data, known for unique challenges compared to images. As machine learning models are increasingly deployed in versatile and critical applications, enhancing their robustness against adversarial attacks remains a priority, where insights from this paper could prove instrumental.

PDF Markdown

Towards Crafting Text Adversarial Samples (1707.02812v1)

Summary

Overview of Adversarial Text Sample Crafting

Related Papers