Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Optimal Data Selection: An Online Distributed View (2201.10547v3)

Published 25 Jan 2022 in cs.LG, cs.AI, and cs.MA

Abstract: The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via submodular maximization. Specifically, we develop algorithms for the online and distributed version of the problem, where data selection occurs in an uncoordinated fashion across multiple data streams. We design a general and flexible core selection routine for our algorithms which, given any stream of data, any assessment of its value, and any formulation of its selection cost, extracts the most valuable subset of the stream up to a constant factor while using minimal memory. Notably, our methods have the same theoretical guarantees as their offline counterparts, and, as far as we know, provide the first guarantees for online distributed submodular optimization in the literature. Finally, in learning tasks on ImageNet and MNIST, we show that our selection methods outperform random selection by $5-20\%$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. “Streaming Submodular Maximization: Massive Data Summarization on the Fly” In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2014, pp. 671–680 DOI: 10.1145/2623330.2623637
  2. Shipra Agrawal, Mohammad Shadravan and Cliff Stein “Submodular Secretary Problem with Shortlists” arXiv:1809.05082, 2018 arXiv:1809.05082
  3. “Submodular Streaming in All its Glory: Tight Approximation, Minimum Memory and Low Adaptive Complexity” In Proceedings of the 36th International Conference on Machine Learning (ICML) 97, 2019, pp. 3311–3320 URL: https://proceedings.mlr.press/v97/kazemi19a.html
  4. “Cardinality Constrained Submodular Maximization for Random Streams” arXiv:2111.07217, 2021 arXiv:2111.07217
  5. Sofia Maria Nikolakaki, Alina Ene and Evimaria Terzi “An Efficient Framework for Balancing Submodularity and Cost” In The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2021, pp. 1256–1266 DOI: 10.1145/3447548.3467367
  6. “Beyond 1/2-Approximation for Submodular Maximization on Massive Data Streams” In Proceedings of the 35th International Conference on Machine Learning (ICML) 80, 2018, pp. 3829–3838 URL: https://proceedings.mlr.press/v80/norouzi-fard18a.html
  7. “Distributed Submodular Maximization: Identifying Representative Elements in Massive Data” In Advances in Neural Information Processing Systems (NeurIPS 26, 2013, pp. 2049–2057 URL: https://proceedings.neurips.cc/paper/2013/file/84d2004bf28a2095230e8e14993d398d-Paper.pdf
  8. Uriel Feige “A Threshold of ln n for Approximating Set Cover” In Journal of the ACM 45, 1998, pp. 634–652 URL: https://doi.org/10.1145/285055.285059
  9. “An Analysis of Approximations for Maximizing Submodular Set Functions” In Mathematical Programming 14, 1978, pp. 265–294
  10. Michel Minoux “Accelerated Greedy Algorithms for Maximizing Submodular Set Functions” In Optimization Techniques, 1978, pp. 234–243
  11. “Fast Algorithms for Maximizing Submodular Functions” In Proceedings of the 25th annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2014, pp. 1497–1514
  12. “Fast Greedy Algorithms in MapReduce and Streaming” In ACM Transactions on Parallel Computing 2, 2015, pp. 1–22 URL: https://doi.org/10.1145/2809814
  13. “Lazier than Lazy Greedy” In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), 2015, pp. 1812–1818
  14. “Distributed online submodular maximization in resource-constrained networks” In 12th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), 2014, pp. 397–404
  15. Daniel Golovin, Matthew Faulkner and Andreas Krause “Online Distributed Sensor Selection”, 2010 arXiv: https://arxiv.org/abs/1002.1782
  16. “The power of randomization: Distributed submodular maximization on massive datasets” In Proceedings of the 32nd International Conference on Machine Learning (ICML) 37, 2015, pp. 1236–1244 URL: https://proceedings.mlr.press/v37/barbosa15.html
  17. “Randomized Composable Core-sets for Distributed Submodular Maximization” In Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC), 2015, pp. 153–162 URL: https://doi.org/10.1145/2746539.2746624
  18. “A New Framework for Distributed Submodular Maximization” In IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), 2016, pp. 645–654
  19. Baharan Mirzasoleiman, Morteza Zadimoghaddam and Amin Karbasi “Fast Distributed Submodular Cover: Public-Private Data Summarization” In Advances in Neural Information Processing Systems (NeurIPS), 2016
  20. “Submodular Optimization in the MapReduce Model” In 2nd Symposium on Simplicity in Algorithms(SOSA 2019) 69, 2018, pp. 18:1–18:10 URL: http://drops.dagstuhl.de/opus/volltexte/2018/10044
  21. Francis Bach “Learning with Submodular Functions: A Convex Optimization Perspective” In Foundations and Trends in Machine Learning 6, 2013, pp. 145–373 DOI: 10.1561/2200000039
  22. David Kempe, Jon Kleinberg and Eva Tardos “Maximizing the Spread of Influence through a Social Network” In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2003, pp. 137–146 URL: https://doi.org/10.1145/956750.956769
  23. Delbert Dueck and Brendan J. Frey “Non-metric Affinity Propagation for Unsupervised Image Categorization” In IEEE 11th International Conference on Computer Vision (ICCV), 2007
  24. “Cost-effective Outbreak Detection in Networks” In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2007, pp. 420–429 URL: https://doi.org/10.1145/1281192.1281239
  25. “Diversifying Search Results” In Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009, pp. 5–14 URL: https://doi.org/10.1145/1498759.1498766
  26. “Turning Down the Noise in the Blogosphere” In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2009, pp. 289–298 URL: https://doi.org/10.1145/1557019.1557056
  27. “Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse, Approximation and Dictionary Selection” In Proceedings of the 28th International Conference on Machine Learning (ICML), 2011, pp. 1057–1064
  28. Khalid El-Arini, Gaurav Veda and Carlos Guestrin “Beyond Keyword Search: Discovering Relevant Scientific Literature” In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2011, pp. 439–447 URL: https://doi.org/10.1145/2020408.2020479
  29. “Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies” Association for Computational Linguistics, 2011, pp. 510–520
  30. “Optimal Selection of Limited Vocabulary Speech Corpora” In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, 2012, pp. 479–490
  31. Abhimanyu Das, Anirban Dasgupta and Ravi Kumar “Selecting diverse features via spectral regularization” In Proceedings of Advances in Neural Information Processing Systems 25, 2012, pp. 1592–1600
  32. Manuel Gomez-Rodriguez, Jure Leskovec and Andreas Krause “Inferring Networks of Diffusion and Influence” In ACM Transactions on Knowledge Discovery from Data 5, 2012, pp. 1–37
  33. “Temporal Corpus Summarization using Submodular Word Coverage” In 21st ACM International Conference on Information and Knowledge Management (CIKM), 2012
  34. Anirban Dasgupta, Ravi Kumar and Sujith Ravi “Summarization Through Submodularity and Dispersion” In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013, pp. 1014–1022 URL: https://aclanthology.org/P13-1100
  35. “Learning Mixtures of Submodular Functions for Image Collection Summarization” In Advances in Neural Information Processing Systems (NeurIPS) 27, 2014
  36. “Submodular Attribute Selection for Action Recognition in Video” In Proceedings of Advances in Neural Information Processing Systems 27, 2014, pp. 1341–1349 URL: https://proceedings.neurips.cc/paper/2014/file/b056eb1587586b71e2da9acfe4fbd19e-Paper.pdf
  37. “Summarization of Multi-Document Topic Hierarchies using Submodular Mixtures” In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL—IJCNLP) 1, 2015, pp. 553–563
  38. “SIMILAR: Submodular Information Measures Based Active Learning in Realistic Scenarios”, 2021 arXiv:2107.00717 [cs.LG]
  39. “PRISM: A Rich Class of Parameterized Submodular Information Measures for Guided Subset Selection” arXiv:2103.00128, 2021 arXiv:2103.00128
  40. “Submodular Combinatorial Information Measures with Applications in Machine Learning” In Proceedings of Machine Learning Research and 32nd International Conference on Algorithmic Learning Theory 132, 2021, pp. 722–754 URL: https://proceedings.mlr.press/v132/iyer21a.html
  41. Rishabh Iyer “Submodular Optimization and Machine Learning: Theoretical Results, Unifying and Scalable Algorithms and Applications”, 2015
  42. “Pytorch: An Imperative Style, High-performance Deep Learning Library” In Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8026–8037
  43. “A Simple Framework for Contrastive Learning of Visual Representations” In Proceedings of the 37th International Conference on Machine Learning (ICML) 19, 2020, pp. 1597–1607 URL: https://proceedings.mlr.press/v119/chen20j.html
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.