Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 191 tok/s Pro
2000 character limit reached

Semantic Deduplication (SemDedup)

Updated 13 August 2025
  • Semantic Deduplication is the process of identifying records that, while not identical, convey similar content using lexical, structural, and social features.
  • It employs methods like socio-semantic and monogram fingerprinting to construct keys that preserve word order and metadata for efficient duplicate detection.
  • The approach enhances heterogeneous data integration, as shown by its high precision and recall in building reliable corpora such as the TOR molecule dataset.

Semantic deduplication (SemDedup) is the process of detecting and managing records, texts, or documents that are not syntactically identical, but which convey the same or highly similar information content. In contrast to traditional deduplication, which relies on exact string or field-level matches, semantic deduplication leverages lexical, structural, and social information, often using LLMing, n-gram techniques, and optimized fingerprinting workflows. This approach is especially critical for integrating heterogeneous bibliographic datasets, where trivial unique keys are often absent due to differences in indexing, formatting, or social data.

1. Methodologies for Semantic Deduplication

The primary methodologies introduced focus on fingerprinting approaches that encode both lexical and social (“socio-semantic”) information into compact keys suitable for efficient duplicate detection. Two key algorithms are emphasized:

  • Socio-Semantic Fingerprinting (SSF):
    • Constructs a key by concatenating bigrams extracted from the title and the first author’s name.
    • For a document, the title is converted to lower case and parsed into words. For the first N=8N=8 words, the first two characters (bigram) from each word are retained, preserving original word order. The first bigram of the author’s name is appended, yielding a compact key.
  • Monogram Fingerprinting (MGF):
    • Operates exclusively at the character level on the lowercase title.
    • The algorithm retains the first occurrence (in reading order) of every letter in the title, generating a sequence that ignores subsequent repetitions and non-alphabetic characters.

Both methods are designed for high efficiency, with key extraction linear in the number of documents and fast candidate matching via hash table lookups. A variant, Sorted Monogram Fingerprinting (SMGF), that ignores character order, is shown to perform poorly, underscoring the necessity of sequence preservation in semantic deduplication.

2. Feature Extraction: N-grams and Key Construction

LLMs grounded in n-gram theory form the basis of key extraction:

  • An n-gram is formally defined as a sequence of nn contiguous characters from a given alphabet AA: {C1,,Cn}, CiA\{C_1, \dots, C_n\}, \ C_i \in A. These provide a robust lexical signature, largely invariant to minor rephrasings or formatting differences.
  • In SSF, the key for a document is formed as KSSF=(bigram1,...,bigramN,bigramauthor)K_{SSF} = (\text{bigram}_1, ..., \text{bigram}_N, \text{bigram}_\text{author}), capturing both word order and authorial context.
  • In MGF, the process parses the title character by character, recording only the first instance of each alphabet character. The SMGF variant alphabetically sorts the resulting character set, but this significantly degrades discriminative performance, highlighting that semantic similarity in real data retains strong ties to sequence order.

This use of n-grams (and optional collocations, or n-word chunks) bridges the gap between naive string matching and robust semantic equivalence.

3. Evaluation Protocols and Empirical Results

Evaluation leverages established information retrieval metrics:

  • Recall: Proportion of true duplicate pairs correctly identified.
  • Precision: Proportion of identified duplicate pairs that are truly duplicates.

Experimental validation was conducted against a gold standard based on PubMed and Web of Science records, specifically for a manually curated tor molecule corpus.

Method Recall (%) Precision (%) False Negatives False Positives
SSF 95 100 10 0
MGF >95 100 8 0
TF / MTF 64.9–74.8 100 >50 0
SVS/CSB up to 99.5 <100 <3 Several

Monte Carlo (random) baselines gave around 46% for both recall and precision, demonstrating the challenge of semantic deduplication in the absence of robust features.

Additionally, similarity-based approaches such as Salton Vector Space (using cosine similarity) and Jaccard coefficient for collocation-based matching were considered; while these can yield very high recall, they typically admit more false positives than the key-based SSF and MGF methods.

4. Comparative Analysis with Alternative Approaches

The paper benchmarks ten deduplication strategies ranging from simple keys to sophisticated similarity calculations. Key conclusions are:

  • Author-only keys (AF): Perfect recall but low precision (~64%), as author names alone are not distinctive.
  • Title keys (TF) and Modified Title Keys (MTF): Show high precision but reduced recall due to title variations.
  • LLM fingerprints (SSF, MGF): Achieve the best balance, with very high recall and perfect (100%) precision, outperforming both naïve and highly fuzzy techniques.
  • Order preservation: Methods ignoring character/word order (e.g., SMGF) perform very poorly, underscoring the necessity of preserving the semantic sequence.

The superiority of SSF and MGF is attributed to their simultaneous encoding of lexical, sequential, and social features in compact representations, allowing for highly efficient and accurate hash-based candidate generation.

5. Application: Data Integration for the TOR Molecule Corpus

The deduplication workflow was deployed to build an integrated corpus spanning PubMed (about 7,700 filtered documents) and Web of Science (about 16,200 documents) for research on the TOR molecule. This approach enables extraction of gene interaction networks and regulatory relationships by consolidating non-trivially overlapping corpora. The deduplication step is essential; heterogeneity in gene nomenclature, title formatting, and metadata across sources means trivial key matches are insufficient.

Through efficient deduplication, data quality and reliability in downstream biological network mining are significantly enhanced, supporting advanced research on TOR and associated gene functions.

6. Implications and Generalization to Semantic Deduplication

The findings demonstrate several critical principles for semantic deduplication beyond the specific domain:

  • Feature Selection: Carefully designed n-gram and monogram keys, especially those capturing word order and social context, are robust to source variability and resistant to false matches.
  • Integration of Social Cues: Including metadata such as author names increases discriminatory power, especially when title or content alone is insufficiently distinctive.
  • Beyond String Matching: Semantic deduplication outperforms basic string or token matching by capturing document structure, sequence, and context—a necessity for integrating large, heterogeneous repositories.
  • Scalability: The linear-time key extraction and hash-based lookup methods ensure scalability to very large datasets, with complexity O(N+M)O(N + M) for key construction and O(NlogM)O(N \cdot \log M) per test document for matching, where NN and MM are sizes of the corpora.
  • Broader Applicability: While demonstrated in bibliographic data integration, these techniques provide a template for deduplication in emails, document archives, blogs, and any domain characterized by structural heterogeneity and absent unique identifiers.

7. Conclusion

Semantic deduplication, as developed in this line of research, integrates efficient n-gram–based and monogram–based fingerprinting approaches with selective inclusion of social metadata to robustly identify duplicates even in highly heterogeneous bibliographic repositories. Quantitative evaluations confirm that such methods attain near-optimal recall (>95%) and perfect precision, yielding high-integrity corpora for downstream analytics. These findings reinforce the imperative of semantic, order-preserving, and socially informed feature design in record linkage and deduplication for modern scientific data integration scenarios (Turenne, 2015).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)