An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification (2409.03203v2)
Abstract: Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion LLM (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the LLM. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
- Do not have enough data? deep learning to the rescue! In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7383–7390.
- Adversarial word dilution as text data augmentation in low-resource regime. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12626–12634.
- Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics, 6:557–570.
- A distantly-supervised relation extraction method based on selective gate and noise correction. In China National Conference on Chinese Computational Linguistics, pages 159–174. Springer.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- A survey of data augmentation approaches for nlp. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988.
- Joseph Gatto and Sarah M Preum. 2023. Not enough labeled data? just add semantics: A data-efficient method for inferring online health texts. arXiv preprint arXiv:2309.09877.
- A novel active learning method using svm for text classification. International Journal of Automation and Computing, 15:290–298.
- Genius: Sketch-based language model pre-training via extreme and selective masking for text generation and augmentation. arXiv preprint arXiv:2211.10330.
- DiffusionBERT: Improving generative masked language models with diffusion models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4521–4534, Toronto, Canada. Association for Computational Linguistics.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851.
- AEDA: An easier data augmentation technique for text classification. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2748–2754, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Alp: Data augmentation using lexicalized pcfgs for few-shot text classification. In Proceedings of the aaai conference on artificial intelligence, volume 36, pages 10894–10902.
- Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18–26.
- Classification of disaster tweets using natural language processing pipeline. Acta Scientific COMPUTER SCIENCES Volume, 5(3).
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
- Bangla emergency post classification on social media using transformer based bert models. In 2023 6th International Conference on Electrical Information and Communication Technology (EICT), pages 1–6. IEEE.
- SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1268–1283, Online. Association for Computational Linguistics.
- Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR.
- Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126.
- Enhancing low-resource LLMs classification with PEFT and synthetic data. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 6017–6023, Torino, Italia. ELRA and ICCL.
- Domain adaptation for sparse-data settings: What do we gain by not using bert? arXiv e-prints, pages arXiv–2203.
- Sam Shleifer. 2019. Low resource text classification with ulmfit and backtranslation. arXiv preprint arXiv:1903.09244.
- Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8968–8975.
- Out-of-domain detection for low-resource text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3566–3572.
- Enhancing Hindi feature representation through fusion of dual-script word embeddings. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 5966–5976, Torino, Italia. ELRA and ICCL.
- Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6382–6388, Hong Kong, China. Association for Computational Linguistics.
- Zhihao Wen and Yuan Fang. 2023. Augmenting low-resource text classification with graph-grounded pre-training and prompting. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 506–516.
- Conditional bert contextual augmentation. In Computational Science–ICCS 2019: 19th International Conference, Faro, Portugal, June 12–14, 2019, Proceedings, Part IV 19, pages 84–95. Springer.
- Senwave: Monitoring the global sentiments under the covid-19 pandemic. Preprint, arXiv:2006.10842.
- Cross-domain data augmentation with domain-adaptive language modeling for aspect-based sentiment analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1456–1470, Toronto, Canada. Association for Computational Linguistics.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
- Self-evolution learning for mixup: Enhance data augmentation on few-shot text classification tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8964–8974, Singapore. Association for Computational Linguistics.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.