MixText: Linguistically-Informed Interpolation for Semi-Supervised Text Classification
The paper introduces MixText, a novel method for semi-supervised text classification that leverages a newly devised data augmentation technique, TMix. This approach uses the hidden state interpolation of text data to create extensive augmented training samples, effectively bridging labeled, unlabeled, and augmented datasets. It advances the ability to guess low-entropy labels for unlabeled data, treating them comparably to labeled data. Empirically, MixText exhibits superior performance against pre-trained, fine-tuned models and other state-of-the-art semi-supervised techniques, particularly in scenarios with highly constrained supervision.
Key Contributions
- TMix Data Augmentation: TMix adapts the Mixup technique from image processing to the text domain. By combining hidden states of two text samples at a selected layer within a model, TMix generates new data points that help prevent overfitting, especially when few labeled data are available.
- Semi-Supervised Framework: MixText utilizes TMix to model relationships between labeled and unlabeled datasets. Unlabeled data is assigned guessed labels through weighted averages from augmented and original samples, enhancing consistency and confidence.
- Performance Evaluation: Experiments demonstrate MixText’s effectiveness in outperforming baseline models such as BERT and UDA across multiple datasets including AG News and IMDB. This performance is particularly pronounced in environments with ten labeled data points per class.
Numerical Results
MixText consistently achieves higher test accuracy compared to other models. For example, with only 10 labeled samples per class on the AG News dataset, MixText achieves an accuracy of 88.4%, notably outperforming UDA which reaches 84.4%. Similar trends are observed across other datasets.
Methodology
The TMix method involves the interpolation of hidden representations from a BERT encoder, drawing on selected layers (such as layers 7, 9, 12) which capture varied syntactic and semantic information. This interpolation generates new training data that enable models to better generalize from limited labeled datasets.
Implications and Future Directions
The successful application of interpolation-based regularizers such as TMix in the text domain opens new avenues for augmentation techniques in NLP. The framework's ability to incorporate unlabeled data effectively suggests potential for broader applications, including sequential labeling tasks.
Future work could explore MixText's adaptability to different NLP tasks and evaluate its impact on other model architectures. Additionally, extending the technique to multilingual contexts could enhance its utility in diverse linguistic settings.
In summary, the paper presents MixText as a robust enhancement in semi-supervised learning for text classification, demonstrating the benefits of hidden state interpolation to leverage both labeled and unlabeled data more effectively.