RETSim: Resilient and Efficient Text Similarity (2311.17264v1)

Published 28 Nov 2023 in cs.CL

Abstract: This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License at https://github.com/google/unisim.

References (33)

Summary

The paper introduces RETSim, a deep learning model that enhances near-duplicate text detection by integrating a state-of-the-art vectorizer, multilingual support, and typo-augmented training.
It presents W4NT3D, a novel benchmark with 400k text pairs designed to rigorously test retrieval algorithms against adversarial typographical modifications.
Empirical results show RETSim’s superior efficiency and robustness compared to traditional n-gram methods, offering practical benefits for content deduplication and spam detection.

Introduction

The development of a reliable method for detecting and clustering near-duplicate text has long been a pivotal challenge for numerous digital applications. Accurate detection of near-duplicates is critical for tasks ranging from plagiarism detection to content deduplication in large text corpora. Traditional approaches have leveraged techniques such as MinHash combined with locality-sensitive hashing. However, they have not been without issues, such as sensitivity to parameter settings and vulnerability to subtle variations in text, like typographical errors.

In contrast, vector-based semantic text retrieval, enhanced by deep learning models, has gained prominence due to its improved semantic understanding. Yet, these models have been subject to drawbacks, including susceptibility to adversarial text manipulations and a high computational cost that precludes their use in large-scale applications.

RETSim: A Lightweight Solution

To address the limitations of existing methods, the concept of RETSim (Resilient and Efficient Text Similarity) is introduced. RETSim is a deep learning model engineered to be both robust and computationally efficient, making it adept at producing neural embeddings that are specialized for the purpose of detecting near-duplicate text. RETSim utilizes a sophisticated combination of components - including the state-of-the-art RETVec text vectorizer, multilingual capability, a typo-augmented training corpus, and refined metric learning techniques. This harmonized integration allows RETSim to achieve superior performance in various tasks such as near-duplicate text retrieval, clustering, and the deduplication of datasets.

Novel Benchmark: W4NT3D

Concurrent with RETSim, a new benchmark named W4NT3D (Wiki-40B 4dversarial Near-T3xt Dataset) has been presented. W4NT3D is specifically designed to assess the effectiveness of text retrieval algorithms against adversarial manipulations in a multilingual landscape. The benchmark includes approximately 400k pairs of text and is configured to measure the retrieval of syntactically similar near-duplicate texts under typographical modifications and other kinds of text distortions.

Analysis and Applications

In empirical evaluations, RETSim outperforms traditional n-gram-based algorithms and maintains its robustness in multilingual settings, responding effectively to a variety of text modifications. With a demonstrated ability to perform competitively on desktop GPUs, and with further optimization for high-end GPUs anticipated, RETSim offers a promising approach to both text retrieval tasks and applications in spam detection.

The creation of this model represents a significant step forward in the field of text similarity detection, providing a tool that strikes a balance between resilience to text modifications and operational efficiency. The authors' decision to open-source RETSim and the W4NT3D benchmark contributes to the accessibility of these advancements for both academic research and practical implementations.

PDF Markdown

Related Papers

GitHub

GitHub - google/unisim: UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data. (137 stars)

Tweets

https://twitter.com/mttrdmnd/status/1871652310214164602