Optimal Data Selection: An Online Distributed View (2201.10547v3)
Abstract: The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via submodular maximization. Specifically, we develop algorithms for the online and distributed version of the problem, where data selection occurs in an uncoordinated fashion across multiple data streams. We design a general and flexible core selection routine for our algorithms which, given any stream of data, any assessment of its value, and any formulation of its selection cost, extracts the most valuable subset of the stream up to a constant factor while using minimal memory. Notably, our methods have the same theoretical guarantees as their offline counterparts, and, as far as we know, provide the first guarantees for online distributed submodular optimization in the literature. Finally, in learning tasks on ImageNet and MNIST, we show that our selection methods outperform random selection by $5-20\%$.
- “Streaming Submodular Maximization: Massive Data Summarization on the Fly” In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2014, pp. 671–680 DOI: 10.1145/2623330.2623637
- Shipra Agrawal, Mohammad Shadravan and Cliff Stein “Submodular Secretary Problem with Shortlists” arXiv:1809.05082, 2018 arXiv:1809.05082
- “Submodular Streaming in All its Glory: Tight Approximation, Minimum Memory and Low Adaptive Complexity” In Proceedings of the 36th International Conference on Machine Learning (ICML) 97, 2019, pp. 3311–3320 URL: https://proceedings.mlr.press/v97/kazemi19a.html
- “Cardinality Constrained Submodular Maximization for Random Streams” arXiv:2111.07217, 2021 arXiv:2111.07217
- Sofia Maria Nikolakaki, Alina Ene and Evimaria Terzi “An Efficient Framework for Balancing Submodularity and Cost” In The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2021, pp. 1256–1266 DOI: 10.1145/3447548.3467367
- “Beyond 1/2-Approximation for Submodular Maximization on Massive Data Streams” In Proceedings of the 35th International Conference on Machine Learning (ICML) 80, 2018, pp. 3829–3838 URL: https://proceedings.mlr.press/v80/norouzi-fard18a.html
- “Distributed Submodular Maximization: Identifying Representative Elements in Massive Data” In Advances in Neural Information Processing Systems (NeurIPS 26, 2013, pp. 2049–2057 URL: https://proceedings.neurips.cc/paper/2013/file/84d2004bf28a2095230e8e14993d398d-Paper.pdf
- Uriel Feige “A Threshold of ln n for Approximating Set Cover” In Journal of the ACM 45, 1998, pp. 634–652 URL: https://doi.org/10.1145/285055.285059
- “An Analysis of Approximations for Maximizing Submodular Set Functions” In Mathematical Programming 14, 1978, pp. 265–294
- Michel Minoux “Accelerated Greedy Algorithms for Maximizing Submodular Set Functions” In Optimization Techniques, 1978, pp. 234–243
- “Fast Algorithms for Maximizing Submodular Functions” In Proceedings of the 25th annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2014, pp. 1497–1514
- “Fast Greedy Algorithms in MapReduce and Streaming” In ACM Transactions on Parallel Computing 2, 2015, pp. 1–22 URL: https://doi.org/10.1145/2809814
- “Lazier than Lazy Greedy” In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), 2015, pp. 1812–1818
- “Distributed online submodular maximization in resource-constrained networks” In 12th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), 2014, pp. 397–404
- Daniel Golovin, Matthew Faulkner and Andreas Krause “Online Distributed Sensor Selection”, 2010 arXiv: https://arxiv.org/abs/1002.1782
- “The power of randomization: Distributed submodular maximization on massive datasets” In Proceedings of the 32nd International Conference on Machine Learning (ICML) 37, 2015, pp. 1236–1244 URL: https://proceedings.mlr.press/v37/barbosa15.html
- “Randomized Composable Core-sets for Distributed Submodular Maximization” In Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC), 2015, pp. 153–162 URL: https://doi.org/10.1145/2746539.2746624
- “A New Framework for Distributed Submodular Maximization” In IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), 2016, pp. 645–654
- Baharan Mirzasoleiman, Morteza Zadimoghaddam and Amin Karbasi “Fast Distributed Submodular Cover: Public-Private Data Summarization” In Advances in Neural Information Processing Systems (NeurIPS), 2016
- “Submodular Optimization in the MapReduce Model” In 2nd Symposium on Simplicity in Algorithms(SOSA 2019) 69, 2018, pp. 18:1–18:10 URL: http://drops.dagstuhl.de/opus/volltexte/2018/10044
- Francis Bach “Learning with Submodular Functions: A Convex Optimization Perspective” In Foundations and Trends in Machine Learning 6, 2013, pp. 145–373 DOI: 10.1561/2200000039
- David Kempe, Jon Kleinberg and Eva Tardos “Maximizing the Spread of Influence through a Social Network” In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2003, pp. 137–146 URL: https://doi.org/10.1145/956750.956769
- Delbert Dueck and Brendan J. Frey “Non-metric Affinity Propagation for Unsupervised Image Categorization” In IEEE 11th International Conference on Computer Vision (ICCV), 2007
- “Cost-effective Outbreak Detection in Networks” In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2007, pp. 420–429 URL: https://doi.org/10.1145/1281192.1281239
- “Diversifying Search Results” In Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009, pp. 5–14 URL: https://doi.org/10.1145/1498759.1498766
- “Turning Down the Noise in the Blogosphere” In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2009, pp. 289–298 URL: https://doi.org/10.1145/1557019.1557056
- “Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse, Approximation and Dictionary Selection” In Proceedings of the 28th International Conference on Machine Learning (ICML), 2011, pp. 1057–1064
- Khalid El-Arini, Gaurav Veda and Carlos Guestrin “Beyond Keyword Search: Discovering Relevant Scientific Literature” In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2011, pp. 439–447 URL: https://doi.org/10.1145/2020408.2020479
- “Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies” Association for Computational Linguistics, 2011, pp. 510–520
- “Optimal Selection of Limited Vocabulary Speech Corpora” In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, 2012, pp. 479–490
- Abhimanyu Das, Anirban Dasgupta and Ravi Kumar “Selecting diverse features via spectral regularization” In Proceedings of Advances in Neural Information Processing Systems 25, 2012, pp. 1592–1600
- Manuel Gomez-Rodriguez, Jure Leskovec and Andreas Krause “Inferring Networks of Diffusion and Influence” In ACM Transactions on Knowledge Discovery from Data 5, 2012, pp. 1–37
- “Temporal Corpus Summarization using Submodular Word Coverage” In 21st ACM International Conference on Information and Knowledge Management (CIKM), 2012
- Anirban Dasgupta, Ravi Kumar and Sujith Ravi “Summarization Through Submodularity and Dispersion” In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013, pp. 1014–1022 URL: https://aclanthology.org/P13-1100
- “Learning Mixtures of Submodular Functions for Image Collection Summarization” In Advances in Neural Information Processing Systems (NeurIPS) 27, 2014
- “Submodular Attribute Selection for Action Recognition in Video” In Proceedings of Advances in Neural Information Processing Systems 27, 2014, pp. 1341–1349 URL: https://proceedings.neurips.cc/paper/2014/file/b056eb1587586b71e2da9acfe4fbd19e-Paper.pdf
- “Summarization of Multi-Document Topic Hierarchies using Submodular Mixtures” In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL—IJCNLP) 1, 2015, pp. 553–563
- “SIMILAR: Submodular Information Measures Based Active Learning in Realistic Scenarios”, 2021 arXiv:2107.00717 [cs.LG]
- “PRISM: A Rich Class of Parameterized Submodular Information Measures for Guided Subset Selection” arXiv:2103.00128, 2021 arXiv:2103.00128
- “Submodular Combinatorial Information Measures with Applications in Machine Learning” In Proceedings of Machine Learning Research and 32nd International Conference on Algorithmic Learning Theory 132, 2021, pp. 722–754 URL: https://proceedings.mlr.press/v132/iyer21a.html
- Rishabh Iyer “Submodular Optimization and Machine Learning: Theoretical Results, Unifying and Scalable Algorithms and Applications”, 2015
- “Pytorch: An Imperative Style, High-performance Deep Learning Library” In Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8026–8037
- “A Simple Framework for Contrastive Learning of Visual Representations” In Proceedings of the 37th International Conference on Machine Learning (ICML) 19, 2020, pp. 1597–1607 URL: https://proceedings.mlr.press/v119/chen20j.html
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.