Submodular Mutual Information (SMI)
- Submodular Mutual Information (SMI) is defined as f(A)+f(Q)-f(A∪Q), generalizing Shannon’s mutual information using the diminishing returns property of submodular functions.
- Its mathematical properties such as non-negativity, symmetry, and monotonicity enable near-optimal greedy maximization for data selection and optimization tasks.
- SMI is widely applied in targeted data subset selection, active learning, summarization, sensor placement, and multivariate information theory to balance relevance and diversity.
Submodular Mutual Information (SMI) is a combinatorial generalization of Shannon mutual information that emerges from the theory of submodular set functions, providing a mathematically rigorous and algorithmically tractable framework for quantifying the shared information content between sets of objects. SMI inherits the diminishing-returns property of submodular functions and is widely used to optimize data subset selection, active learning, summarization, and related tasks across machine learning and information theory. The formal structure and performance guarantees of SMI enable robust selection strategies balancing diversity, coverage, and query relevance.
1. Formal Definition and Mathematical Structure
Let be a normalized, monotone, submodular set function over a ground set , i.e., , for , and for any , : This is the diminishing-returns property. Given two subsets , the Submodular Mutual Information is
SMI quantifies the overlap in "information" between 0 and 1 in the sense defined by 2, generalizing Shannon mutual information when 3 is an entropy function. For conditional SMI, given an additional "private" set 4, the conditional form is
5
(Iyer et al., 2020, Kaushal et al., 2020, Beck et al., 2024, Kothawade et al., 2022)
2. Theoretical Properties
SMI retains several crucial properties under mild conditions:
- Non-negativity: 6 if 7 is submodular. Equality holds if and only if 8 and 9 are "independent" with respect to 0, e.g., for entropy, if the random variables are mutually independent.
- Symmetry: 1.
- Monotonicity: For fixed 2, 3 is monotone non-decreasing in 4.
- Submodularity in One Argument: For a wide class of base functions 5 (including facility location, set-cover, and graph-cut), 6 is also submodular. Sufficient conditions involve non-negativity of certain higher-order discrete derivatives of 7 (Iyer et al., 2020).
- Bounds: 8.
- Approximation Guarantee: Maximizing monotone submodular SMI under a cardinality constraint yields a 9 approximation factor via the greedy algorithm (Nemhauser et al., 1978).
(Iyer et al., 2020, Kaushal et al., 2020, Beck et al., 2024, Beck et al., 2024, Li et al., 2022, Kothawade et al., 2021)
3. Instantiations and Closed-Form Variants
SMI admits concrete, efficient forms for many popular submodular functions:
| SMI Variant | Base Function 0 / Formula | SMI Expression (Simplified) |
|---|---|---|
| Facility-Location | 1 | 2 |
| Graph-Cut | 3 | 4 |
| Log-Determinant | 5 (for PSD kernel 6, 7) | 8 |
| Prob. Set Cover | 9 | 0 |
These variants enable direct encoding of coverage, relevance, and diversity in various application domains. For multivariate generalizations, SMI further encompasses:
- Total Correlation: 1
- Dual Total Correlation: 2
- Fractional SMI: 3 (Jakhar et al., 21 Jan 2025, Iyer et al., 2020, Kaushal et al., 2020, Beck et al., 2024, Kothawade et al., 2021, Li et al., 2022, Jakhar et al., 21 Jan 2025, Beck et al., 2024)
4. Algorithmic Optimization
Greedy maximization is provably near-optimal for monotone submodular SMI variants. At each step, the element yielding the highest marginal gain is added: 4 Techniques improving efficiency include:
- Lazy Evaluation: Maintaining a priority queue of marginal gains for fast selection (Beck et al., 2024).
- Memoization: Caching intermediate calculations to reduce per-step runtime to 5 or 6 for selected forms (Beck et al., 2024, Kothawade et al., 2021, Kothawade et al., 2021).
- Partitioning: For very large pools, pool partitioning enables distributed greedy selection (Kothawade et al., 2021).
- Theoretical guarantees: For classes of SMI with monotone submodularity, the solution is within 7 of optimum (Iyer et al., 2020, Beck et al., 2024).
(Beck et al., 2024, Kothawade et al., 2021, Kothawade et al., 2021, Iyer et al., 2020)
5. Applications in Machine Learning and Information Theory
SMI is foundational in multiple domains:
- Targeted Data Subset Selection: Selecting unlabeled samples maximally "mutually informative" with a query/exemplar set boosts rare-class and overall performance in both vision and language (Kothawade et al., 2021, Beck et al., 2024).
- Active Learning: SMI-guided acquisition functions outperform uncertainty/diversity heuristics, especially under class imbalance, rare slices, or OOD data (Kothawade et al., 2021, Kothawade et al., 2021, Kothawade et al., 2022).
- Summarization: Query-focused, privacy-preserving, and update data summarization are unified under SMI with explicit, interpretable objectives (Kaushal et al., 2020, Iyer et al., 2020).
- Meta-Learning and Semi-Supervision: In episodic meta-learning, per-class SMI acquisition promotes balanced pseudo-labeling, resilience to OOD, and robust adaptation (Li et al., 2022).
- In-context Retrieval and Ranking: Jointly maximizing query relevance and exemplar diversity via SMI yields state-of-the-art in-context retrieval across question-answering and NLU tasks (Nanda et al., 28 Aug 2025).
- Sensor Placement: For Gaussian sources with additive noise, classical mutual information is submodular, rendering greedy sensor selection near-optimal (Crowley et al., 2024).
- Multivariate Information Theory: SMI with fractional partitions unifies total correlation, dual total correlation, and shared information, linking combinatorial inequalities, entropic inequalities, and matrix analytic results (Jakhar et al., 21 Jan 2025).
(Kothawade et al., 2021, Beck et al., 2024, Kaushal et al., 2020, Kothawade et al., 2021, Kothawade et al., 2021, Li et al., 2022, Kothawade et al., 2022, Beck et al., 2024, Jakhar et al., 21 Jan 2025, Nanda et al., 28 Aug 2025, Crowley et al., 2024)
6. Theoretical Guarantees: Sensitivity to Coverage and Relevance
Explicit similarity-based performance bounds tie SMI scores to actionable metrics:
- Query Relevance (8): Lower and upper bounds for the number of true targets selected are linear in 9 for several variants (e.g., FLVMI, GCMI), under mild assumptions on similarity distributions.
- Query Coverage (0): Similar bounds hold for average coverage of remaining targeted points or queries, showing either guaranteed query coverage, high relevance, or an explicit trade-off governed by SMI parameters.
- Sensitivity Trade-off: Facility-Location SMI is highly sensitive to coverage but less to relevance; Graph-Cut SMI is the converse. Facility-Location Query Mutual Information (FLQMI) and Concave-Over-Modular interpolate between these extremes. Adjusting parameter 1 can trade off between the two (Beck et al., 2024).
- Tightness: As the separation between targeted and untargeted similarity increases, bounds on relevance and coverage become tight, explaining SMI's empirical performance.
7. Multivariate and Fractional SMI: Generalizations and Deep Connections
Fractional SMI, defined for any fractional partition 2 with 3, is given by
4
This framework unifies total correlation, dual total correlation, and shared information. Key properties include:
- Non-negativity: Vanishes if and only if the ground variables are independent.
- Maximum is Total Correlation: Among all fractional partitions, the singleton (total correlation) is maximal.
- Data Processing and Chain Rule: Satisfies strong multivariate data processing and recursion relations.
- Determinantal Inequalities: Fractional SMI recovers matrix inequalities (e.g., Hadamard–Fischer–Szász) for positive definite kernels when applied to log-determinant functions (Jakhar et al., 21 Jan 2025).
In summary, Submodular Mutual Information forms a rigorous backbone for a diverse array of data selection tasks, interpolation between coverage and relevance objectives, and generalization of classical information measures. Its theoretical underpinnings and practical instantiations yield computationally efficient, provably good selection mechanisms with broad application in modern data-centric machine learning pipelines.