2000 character limit reached
Document Counting in Practice (1409.6780v2)
Published 24 Sep 2014 in cs.DS
Abstract: We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these solutions and develop some new variants, comparing them experimentally on various datasets. Our results not only show which are the best options for each situation and help discard practically unappealing solutions, but also uncover some unexpected compressibility properties of the best data structures. By taking advantage of these properties, we can reduce the size of the structures by a factor of 5--400, depending on the dataset.