Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DarkDiff: Explainable web page similarity of TOR onion sites (2308.12134v1)

Published 23 Aug 2023 in cs.CR

Abstract: In large-scale data analysis, near-duplicates are often a problem. For example, with two near-duplicate phishing emails, a difference in the salutation (Mr versus Ms) is not essential, but whether it is bank A or B is important. The state-of-the-art in near-duplicate detection is a black box approach (MinHash), so one only knows that emails are near-duplicates, but not why. We present DarkDiff, which can efficiently detect near-duplicates while providing the reason why there is a near-duplicate. We have developed DarkDiff to detect near-duplicates of homepages on the Darkweb. DarkDiff works well on those pages because they resemble the clear web of the past.

Summary

We haven't generated a summary for this paper yet.