MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification (2004.12239v1)

Published 25 Apr 2020 in cs.CL and cs.LG

Abstract: This paper presents MixText, a semi-supervised learning method for text classification, which uses our newly designed data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. Moreover, we leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data, hence making them as easy to use as labeled data.By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks. The improvement is especially prominent when supervision is extremely limited. We have publicly released our code at https://github.com/GT-SALT/MixText.

View on arXiv

Authors (3)

Jiaao Chen (31 papers)
Zichao Yang (27 papers)
Diyi Yang (151 papers)

Citations (338)

View on Semantic Scholar

Summary

MixText: Linguistically-Informed Interpolation for Semi-Supervised Text Classification

The paper introduces MixText, a novel method for semi-supervised text classification that leverages a newly devised data augmentation technique, TMix. This approach uses the hidden state interpolation of text data to create extensive augmented training samples, effectively bridging labeled, unlabeled, and augmented datasets. It advances the ability to guess low-entropy labels for unlabeled data, treating them comparably to labeled data. Empirically, MixText exhibits superior performance against pre-trained, fine-tuned models and other state-of-the-art semi-supervised techniques, particularly in scenarios with highly constrained supervision.

Key Contributions

TMix Data Augmentation: TMix adapts the Mixup technique from image processing to the text domain. By combining hidden states of two text samples at a selected layer within a model, TMix generates new data points that help prevent overfitting, especially when few labeled data are available.
Semi-Supervised Framework: MixText utilizes TMix to model relationships between labeled and unlabeled datasets. Unlabeled data is assigned guessed labels through weighted averages from augmented and original samples, enhancing consistency and confidence.
Performance Evaluation: Experiments demonstrate MixText’s effectiveness in outperforming baseline models such as BERT and UDA across multiple datasets including AG News and IMDB. This performance is particularly pronounced in environments with ten labeled data points per class.

Numerical Results

MixText consistently achieves higher test accuracy compared to other models. For example, with only 10 labeled samples per class on the AG News dataset, MixText achieves an accuracy of 88.4%, notably outperforming UDA which reaches 84.4%. Similar trends are observed across other datasets.

Methodology

The TMix method involves the interpolation of hidden representations from a BERT encoder, drawing on selected layers (such as layers 7, 9, 12) which capture varied syntactic and semantic information. This interpolation generates new training data that enable models to better generalize from limited labeled datasets.

Implications and Future Directions

The successful application of interpolation-based regularizers such as TMix in the text domain opens new avenues for augmentation techniques in NLP. The framework's ability to incorporate unlabeled data effectively suggests potential for broader applications, including sequential labeling tasks.

Future work could explore MixText's adaptability to different NLP tasks and evaluate its impact on other model architectures. Additionally, extending the technique to multilingual contexts could enhance its utility in diverse linguistic settings.

In summary, the paper presents MixText as a robust enhancement in semi-supervised learning for text classification, demonstrating the benefits of hidden state interpolation to leverage both labeled and unlabeled data more effectively.

PDF Markdown