Emma
Summary:
Researchers have developed a method called SemDeDup that uses pre-trained models to identify and remove semantic duplicates in large datasets, which can speed up machine learning training time. The researchers found that SemDeDup was able to remove 50% of the data in a subset of the LAION dataset with minimal performance loss, and also improved performance out of
distribution.