Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration (2501.01046v3)

Published 2 Jan 2025 in cs.CL

Abstract: Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving the training performance and efficiency of LLMs. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework, FED, that optimizes MinHash LSH for GPU clusters and leverages computationally efficient, partially reusable non-cryptographic hash functions. FED significantly outperforms the CPU-based deduplication tool in SlimPajama (using 64 logical CPU cores) by up to 107.2 times and the GPU-based tool in NVIDIA NeMo Curator by up to 6.3 times when processing 30 million documents on a node with four GPUs. Notably, our method dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speed-ups of up to 260 compared to the CPU baseline. Despite these gains in efficiency, FED maintains high deduplication quality, with the duplicate document sets reaching a Jaccard similarity of over 0.96 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 6 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (\href{https://github.com/mcrl/FED}{https://github.com/mcrl/FED}).

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube