RETSim: Resilient and Efficient Text Similarity (2311.17264v1)
Abstract: This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License at https://github.com/google/unisim.
- Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges. Security and Communication Networks, 2022:1–19, February 2022. ISSN 1939-0122, 1939-0114. doi: 10.1155/2022/1862888. URL https://www.hindawi.com/journals/scn/2022/1862888/.
- Generating Natural Language Adversarial Examples, September 2018. URL http://arxiv.org/abs/1804.07998. arXiv:1804.07998 [cs].
- PaLM 2 Technical Report, September 2023. URL http://arxiv.org/abs/2305.10403. arXiv:2305.10403 [cs].
- Min-wise independent permutations. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 327–336, 1998.
- RetVec: Resilient and Efficient Text Vectorizer. 2023.
- Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
- Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp. 380–388, 2002.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. URL http://arxiv.org/abs/1810.04805. arXiv:1810.04805 [cs].
- Language-agnostic BERT Sentence Embedding, March 2022. URL http://arxiv.org/abs/2007.01852. arXiv:2007.01852 [cs].
- Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers, May 2018. URL http://arxiv.org/abs/1801.04354. arXiv:1801.04354 [cs].
- Wiki-40b: Multilingual language model dataset. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 2440–2452, 2020.
- Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 901–910, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.113.
- A large-scale query spelling correction corpus. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1261–1264, 2017.
- Transformer quality in linear time. In International Conference on Machine Learning, pp. 9099–9117. PMLR, 2022.
- Analysis of phishing attacks and countermeasures, 2014.
- Deep Unordered Composition Rivals Syntactic Methods for Text Classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1681–1691, Beijing, China, 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-1162. URL http://aclweb.org/anthology/P15-1162.
- Deduplicating Training Data Mitigates Privacy Risks in Language Models. In Proceedings of the 39th International Conference on Machine Learning, pp. 10697–10707. PMLR, June 2022. URL https://proceedings.mlr.press/v162/kandpal22a.html. ISSN: 2640-3498.
- The Stack: 3 TB of permissively licensed source code, November 2022. URL http://arxiv.org/abs/2211.15533. arXiv:2211.15533 [cs].
- Deduplicating Training Data Makes Language Models Better, March 2022. URL http://arxiv.org/abs/2107.06499. arXiv:2107.06499 [cs].
- A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986, 2022.
- TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP, October 2020. URL http://arxiv.org/abs/2005.05909. arXiv:2005.05909 [cs].
- MTEB: Massive Text Embedding Benchmark, October 2022. URL https://arxiv.org/abs/2210.07316v1.
- Fine-tuning CNN image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018. Publisher: IEEE.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- FaceNet: A Unified Embedding for Face Recognition and Clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823, June 2015. doi: 10.1109/CVPR.2015.7298682. URL http://arxiv.org/abs/1503.03832. arXiv:1503.03832 [cs].
- Noise-Robust De-Duplication at Scale, October 2022. URL http://arxiv.org/abs/2210.04261. arXiv:2210.04261 [cs].
- Near Duplicate Text Detection Using Frequency-Biased Signatures. In David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Xuemin Lin, Yannis Manolopoulos, Divesh Srivastava, and Guangyan Huang (eds.), Web Information Systems Engineering – WISE 2013, volume 8180, pp. 277–291. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. ISBN 978-3-642-41229-5 978-3-642-41230-1. doi: 10.1007/978-3-642-41230-1˙24. URL http://link.springer.com/10.1007/978-3-642-41230-1_24. Series Title: Lecture Notes in Computer Science.
- Ash Vardanian. USearch by Unum Cloud, October 2023. URL https://github.com/unum-cloud/usearch.
- Text Embeddings by Weakly-Supervised Contrastive Pre-training, December 2022. URL http://arxiv.org/abs/2212.03533. arXiv:2212.03533 [cs].
- Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5022–5030, 2019.
- mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
- Multilingual Universal Sentence Encoder for Semantic Retrieval, July 2019. URL http://arxiv.org/abs/1907.04307. arXiv:1907.04307 [cs].
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.