Theoretical Analysis of Submodular Information Measures for Targeted Data Subset Selection (2402.13454v2)
Abstract: With increasing volume of data being used across machine learning tasks, the capability to target specific subsets of data becomes more important. To aid in this capability, the recently proposed Submodular Mutual Information (SMI) has been effectively applied across numerous tasks in literature to perform targeted subset selection with the aid of a exemplar query set. However, all such works are deficient in providing theoretical guarantees for SMI in terms of its sensitivity to a subset's relevance and coverage of the targeted data. For the first time, we provide such guarantees by deriving similarity-based bounds on quantities related to relevance and coverage of the targeted data. With these bounds, we show that the SMI functions, which have empirically shown success in multiple applications, are theoretically sound in achieving good query relevance and query coverage.
- Submodular combinatorial information measures with applications in machine learning. In Algorithmic Learning Theory, pages 722–754. PMLR, 2021.
- Generalized submodular information measures: Theoretical properties, examples, optimization algorithms, and applications. IEEE Transactions on Information Theory, 68(2):752–781, 2021.
- An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14:265–294, 1978.
- Similar: Submodular information measures based active learning in realistic scenarios. Advances in Neural Information Processing Systems, 34:18685–18697, 2021.
- Active data discovery: Mining unknown data using submodular information measures. arXiv preprint arXiv:2206.08566, 2022.
- Talisman: targeted active learning for object detection with rare classes and slices using submodular mutual information. In European Conference on Computer Vision, pages 1–16. Springer, 2022.
- Clinical: Targeted active learning for imbalanced medical image classification. In Workshop on Medical Image Learning with Limited and Noisy Data, pages 119–129. Springer, 2022.
- Diagnose: Avoiding out-of-distribution data using submodular information measures. In Workshop on Medical Image Learning with Limited and Noisy Data, pages 141–150. Springer, 2022.
- Orient: Submodular mutual information measures for data subset selection under distribution shift. Advances in neural information processing systems, 35:31796–31808, 2022.
- Beyond active learning: Leveraging the full potential of human interaction via auto-labeling, human correction, and human verification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2881–2889, 2024.
- Platinum: Semi-supervised model agnostic meta-learning using submodular mutual information. In International Conference on Machine Learning, pages 12826–12842. PMLR, 2022.
- DITTO: Data-efficient and fair targeted subset selection for ASR accent adaptation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5810–5822, Toronto, Canada, July 2023. Association for Computational Linguistics.
- Streamline: Streaming active learning for realistic multi-distributional settings. arXiv preprint arXiv:2305.10643, 2023.
- Prism: A rich class of parameterized submodular information measures for guided data subset selection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10238–10246, 2022.
- Satoru Fujishige. Submodular functions and optimization. Elsevier, 2005.
- Jeff Bilmes. Submodularity in machine learning and artificial intelligence. arXiv preprint arXiv:2202.00132, 2022.
- A memoization framework for scaling submodular optimization to large scale problems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2340–2349. PMLR, 2019.