Generative Deduplication For Socia Media Data Selection (2401.05883v3)

Published 11 Jan 2024 in cs.CL

Abstract: Social media data exhibits severe redundancy caused by its noisy nature. It leads to increased training time and model bias in its processing. To address this issue, we propose a novel Generative Deduplication framework for social media data selection by removing semantically duplicate data. While related work involves data selection in task-specific training, our model acts as an efficient pre-processing method to universally enhance social media NLP pipelines. Specifically, we train a generative model via self-supervised learning to predict a keyword to capture the semantics of noisy social media text for deduplication. Meanwhile, time-dimensional Gaussian noise is added to improve training complexity and avoid learning trivial features. Extensive experiments suggest that our model can better reduce training samples while improving performance than baselines. The results show our model's potential to broadly advance social media language understanding in effectiveness and efficiency.

References (33)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/xmlee97/status/1750319868409430084

Generative Deduplication For Socia Media Data Selection (2401.05883v3)

Summary

Related Papers

Tweets