Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RETSim: Resilient and Efficient Text Similarity (2311.17264v1)

Published 28 Nov 2023 in cs.CL

Abstract: This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License at https://github.com/google/unisim.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges. Security and Communication Networks, 2022:1–19, February 2022. ISSN 1939-0122, 1939-0114. doi: 10.1155/2022/1862888. URL https://www.hindawi.com/journals/scn/2022/1862888/.
  2. Generating Natural Language Adversarial Examples, September 2018. URL http://arxiv.org/abs/1804.07998. arXiv:1804.07998 [cs].
  3. PaLM 2 Technical Report, September 2023. URL http://arxiv.org/abs/2305.10403. arXiv:2305.10403 [cs].
  4. Min-wise independent permutations. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp.  327–336, 1998.
  5. RetVec: Resilient and Efficient Text Vectorizer. 2023.
  6. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
  7. Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp.  380–388, 2002.
  8. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. URL http://arxiv.org/abs/1810.04805. arXiv:1810.04805 [cs].
  9. Language-agnostic BERT Sentence Embedding, March 2022. URL http://arxiv.org/abs/2007.01852. arXiv:2007.01852 [cs].
  10. Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers, May 2018. URL http://arxiv.org/abs/1801.04354. arXiv:1801.04354 [cs].
  11. Wiki-40b: Multilingual language model dataset. In Proceedings of the 12th Language Resources and Evaluation Conference, pp.  2440–2452, 2020.
  12. Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  901–910, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.113.
  13. A large-scale query spelling correction corpus. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  1261–1264, 2017.
  14. Transformer quality in linear time. In International Conference on Machine Learning, pp. 9099–9117. PMLR, 2022.
  15. Analysis of phishing attacks and countermeasures, 2014.
  16. Deep Unordered Composition Rivals Syntactic Methods for Text Classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1681–1691, Beijing, China, 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-1162. URL http://aclweb.org/anthology/P15-1162.
  17. Deduplicating Training Data Mitigates Privacy Risks in Language Models. In Proceedings of the 39th International Conference on Machine Learning, pp.  10697–10707. PMLR, June 2022. URL https://proceedings.mlr.press/v162/kandpal22a.html. ISSN: 2640-3498.
  18. The Stack: 3 TB of permissively licensed source code, November 2022. URL http://arxiv.org/abs/2211.15533. arXiv:2211.15533 [cs].
  19. Deduplicating Training Data Makes Language Models Better, March 2022. URL http://arxiv.org/abs/2107.06499. arXiv:2107.06499 [cs].
  20. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11976–11986, 2022.
  21. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP, October 2020. URL http://arxiv.org/abs/2005.05909. arXiv:2005.05909 [cs].
  22. MTEB: Massive Text Embedding Benchmark, October 2022. URL https://arxiv.org/abs/2210.07316v1.
  23. Fine-tuning CNN image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018. Publisher: IEEE.
  24. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  25. FaceNet: A Unified Embedding for Face Recognition and Clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  815–823, June 2015. doi: 10.1109/CVPR.2015.7298682. URL http://arxiv.org/abs/1503.03832. arXiv:1503.03832 [cs].
  26. Noise-Robust De-Duplication at Scale, October 2022. URL http://arxiv.org/abs/2210.04261. arXiv:2210.04261 [cs].
  27. Near Duplicate Text Detection Using Frequency-Biased Signatures. In David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Xuemin Lin, Yannis Manolopoulos, Divesh Srivastava, and Guangyan Huang (eds.), Web Information Systems Engineering – WISE 2013, volume 8180, pp.  277–291. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. ISBN 978-3-642-41229-5 978-3-642-41230-1. doi: 10.1007/978-3-642-41230-1˙24. URL http://link.springer.com/10.1007/978-3-642-41230-1_24. Series Title: Lecture Notes in Computer Science.
  28. Ash Vardanian. USearch by Unum Cloud, October 2023. URL https://github.com/unum-cloud/usearch.
  29. Text Embeddings by Weakly-Supervised Contrastive Pre-training, December 2022. URL http://arxiv.org/abs/2212.03533. arXiv:2212.03533 [cs].
  30. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5022–5030, 2019.
  31. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
  32. Multilingual Universal Sentence Encoder for Semantic Retrieval, July 2019. URL http://arxiv.org/abs/1907.04307. arXiv:1907.04307 [cs].
  33. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.

Summary

  • The paper introduces RETSim, a deep learning model that enhances near-duplicate text detection by integrating a state-of-the-art vectorizer, multilingual support, and typo-augmented training.
  • It presents W4NT3D, a novel benchmark with 400k text pairs designed to rigorously test retrieval algorithms against adversarial typographical modifications.
  • Empirical results show RETSim’s superior efficiency and robustness compared to traditional n-gram methods, offering practical benefits for content deduplication and spam detection.

Introduction

The development of a reliable method for detecting and clustering near-duplicate text has long been a pivotal challenge for numerous digital applications. Accurate detection of near-duplicates is critical for tasks ranging from plagiarism detection to content deduplication in large text corpora. Traditional approaches have leveraged techniques such as MinHash combined with locality-sensitive hashing. However, they have not been without issues, such as sensitivity to parameter settings and vulnerability to subtle variations in text, like typographical errors.

In contrast, vector-based semantic text retrieval, enhanced by deep learning models, has gained prominence due to its improved semantic understanding. Yet, these models have been subject to drawbacks, including susceptibility to adversarial text manipulations and a high computational cost that precludes their use in large-scale applications.

RETSim: A Lightweight Solution

To address the limitations of existing methods, the concept of RETSim (Resilient and Efficient Text Similarity) is introduced. RETSim is a deep learning model engineered to be both robust and computationally efficient, making it adept at producing neural embeddings that are specialized for the purpose of detecting near-duplicate text. RETSim utilizes a sophisticated combination of components - including the state-of-the-art RETVec text vectorizer, multilingual capability, a typo-augmented training corpus, and refined metric learning techniques. This harmonized integration allows RETSim to achieve superior performance in various tasks such as near-duplicate text retrieval, clustering, and the deduplication of datasets.

Novel Benchmark: W4NT3D

Concurrent with RETSim, a new benchmark named W4NT3D (Wiki-40B 4dversarial Near-T3xt Dataset) has been presented. W4NT3D is specifically designed to assess the effectiveness of text retrieval algorithms against adversarial manipulations in a multilingual landscape. The benchmark includes approximately 400k pairs of text and is configured to measure the retrieval of syntactically similar near-duplicate texts under typographical modifications and other kinds of text distortions.

Analysis and Applications

In empirical evaluations, RETSim outperforms traditional n-gram-based algorithms and maintains its robustness in multilingual settings, responding effectively to a variety of text modifications. With a demonstrated ability to perform competitively on desktop GPUs, and with further optimization for high-end GPUs anticipated, RETSim offers a promising approach to both text retrieval tasks and applications in spam detection.

The creation of this model represents a significant step forward in the field of text similarity detection, providing a tool that strikes a balance between resilience to text modifications and operational efficiency. The authors' decision to open-source RETSim and the W4NT3D benchmark contributes to the accessibility of these advancements for both academic research and practical implementations.

X Twitter Logo Streamline Icon: https://streamlinehq.com