Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SSMix: Saliency-Based Span Mixup for Text Classification (2106.08062v1)

Published 15 Jun 2021 in cs.CL, cs.AI, and cs.LG

Abstract: Data augmentation with mixup has shown to be effective on various computer vision tasks. Despite its great success, there has been a hurdle to apply mixup to NLP tasks since text consists of discrete tokens with variable length. In this work, we propose SSMix, a novel mixup method where the operation is performed on input text rather than on hidden vectors like previous approaches. SSMix synthesizes a sentence while preserving the locality of two original texts by span-based mixing and keeping more tokens related to the prediction relying on saliency information. With extensive experiments, we empirically validate that our method outperforms hidden-level mixup methods on a wide range of text classification benchmarks, including textual entailment, sentiment classification, and question-type classification. Our code is available at https://github.com/clovaai/ssmix.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Soyoung Yoon (8 papers)
  2. Gyuwan Kim (20 papers)
  3. Kyumin Park (7 papers)
Citations (60)

Summary

An Analysis of SSMix: A Saliency-Based Span Mixup Approach for Text Classification

This paper proposes SSMix, a novel data augmentation technique tailored for text classification tasks in NLP. Data augmentation using mixup strategies has been largely successful in computer vision, but their direct application in NLP presents challenges due to the discrete nature and variable lengths of text sequences. SSMix differs from traditional neural approaches by performing mixup operations directly on input text rather than on hidden representations.

SSMix innovates by leveraging saliency information to identify and replace spans in text, aiming to maximize the preservation of semantically significant tokens. The method effectively combines parts of two original texts by selectively replacing segments based on their contribution to the prediction, as measured by saliency maps. This attention to salient features not only maintains the locality inherent in the source texts but also enhances the semantic consistency of the synthesized examples.

The authors evaluate SSMix across a suite of established text classification benchmarks, including tasks such as sentiment classification, textual entailment, and question-type classification. These experiments demonstrate that SSMix consistently outperforms existing mixup techniques that operate at the hidden layer level. The methodology enhances generalization and robustness, showing particular strength in managing datasets with larger label sets and paired sentence constructs.

Key empirical findings indicate that the input-level mixup in SSMix captures a broader synthetic data space compared to linear interpolation methods like EmbedMix and TMix. The results suggest that SSMix is more effective when sufficient data is available, and when tasks involve multiple class labels, allowing cross-label augmentation to enhance diversity. Furthermore, the saliency-based span selection allows SSMix to strategically integrate text components more relevant to model predictions.

SSMix's design is also explored through an ablation paper to deconstruct the contributions of its core components: saliency and span restrictions. The findings confirm that both elements independently and collectively enhance performance, underscoring the importance of using saliency information for span selection and maintaining span-level integrity in the mixup process.

In summary, SSMix represents a promising advancement in NLP data augmentation, successfully adapting input-level mixup to the discrete and structured nature of text data. By focusing on token saliency, SSMix manages to produce meaningful augmented samples, addressing the intricacies of text classification. This work invites further exploration of saliency-augmented data creation methods across other NLP tasks, including text generation and semi-supervised learning settings. Future research can expand SSMix's application to a broader array of models and architectures, paving the way for refined regularization techniques in NLP.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com