Generative Deduplication For Socia Media Data Selection (2401.05883v3)
Abstract: Social media data exhibits severe redundancy caused by its noisy nature. It leads to increased training time and model bias in its processing. To address this issue, we propose a novel Generative Deduplication framework for social media data selection by removing semantically duplicate data. While related work involves data selection in task-specific training, our model acts as an efficient pre-processing method to universally enhance social media NLP pipelines. Specifically, we train a generative model via self-supervised learning to predict a keyword to capture the semantics of noisy social media text for deduplication. Meanwhile, time-dimensional Gaussian noise is added to improve training complexity and avoid learning trivial features. Extensive experiments suggest that our model can better reduce training samples while improving performance than baselines. The results show our model's potential to broadly advance social media language understanding in effectiveness and efficiency.
- Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 355–362, Edinburgh, Scotland, UK. Association for Computational Linguistics.
- TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650, Online. Association for Computational Linguistics.
- Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE.
- Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
- Automatic document selection for efficient encoder pretraining. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9522–9530, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910. Association for Computational Linguistics.
- Maarten Grootendorst. 2020. Keybert: Minimal keyword extraction with bert.
- Adaptive near-duplicate detection via similarity learning. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 419–426.
- Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland. Association for Computational Linguistics.
- Xianming Li and Jing Li. 2023. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871.
- Reinforced training data selection for domain adaptation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1957–1968, Florence, Italy. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web, pages 141–150.
- SemEval 2021 task 7: HaHackathon, detecting and rating humor and offense. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 105–119, Online. Association for Computational Linguistics.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630–15649. PMLR.
- Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, pages 220–224, Uppsala, Sweden. Association for Computational Linguistics.
- BERTweet: A pre-trained language model for English tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 9–14, Online. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3980–3990. Association for Computational Linguistics.
- Data selection strategies for multi-domain sentiment analysis. arXiv preprint arXiv:1702.02426.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
- Hicl: Hashtag-driven in-context learning for social media natural language understanding. arXiv preprint arXiv:2308.09985.
- Groundhog day: near-duplicate detection on twitter. In Proceedings of the 22nd international conference on World Wide Web, pages 1273–1284.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Transformers: State-of-the-Art Natural Language Processing. pages 38–45. Association for Computational Linguistics.
- Data selection for language models via importance resampling. In Thirty-seventh Conference on Neural Information Processing Systems.
- Nlp from scratch without large-scale pretraining: A simple and efficient framework. In International Conference on Machine Learning, pages 25438–25451. PMLR.
- Cold-start data selection for better few-shot language model fine-tuning: A prompt-based uncertainty propagation approach. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2499–2521, Toronto, Canada. Association for Computational Linguistics.
- VIBE: Topic-driven temporal adaptation for Twitter classification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3340–3354, Singapore. Association for Computational Linguistics.