Submodular Information Measures (SIM)

Updated 21 April 2026

Submodular Information Measures (SIM) are a combinatorial generalization of classical information metrics that replace Shannon entropy with a monotone submodular set function.
SIMs offer strong theoretical guarantees with efficient greedy algorithms that achieve near-optimal (1-1/e) approximations for tasks like summarization and active learning.
Applications of SIMs span active learning, privacy filtering, causal inference, and representation learning, often yielding significant performance gains in empirical studies.

A submodular information measure (SIM) is a combinatorial generalization of classical information-theoretic measures, such as entropy, mutual information, and conditional mutual information, in which the foundational role of Shannon entropy is replaced with a general monotone submodular set function. SIMs provide an abstract algebraic framework for modeling information, relevance, coverage, independence, and diversity on arbitrary ground sets, extending beyond random variables to structured data, feature sets, and combinatorial objects. SIMs have deep implications across data subset selection, active learning, summarization, privacy, representation learning, causal inference, and extremal combinatorics.

1. Formal Definitions and Mathematical Structure

Let $V$ be a finite ground set and $f: 2^V \to \mathbb{R}$ a normalized, monotone, submodular set function: $f(\emptyset) = 0$ , $f(A) \leq f(B)$ whenever $A \subseteq B$ , and for all $A, B \subseteq V$ , $f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ .

The key submodular information measures are:

Submodular Conditional Gain (SCG):

$f(A \mid P) := f(A \cup P) - f(P)$

Intuition: The incremental "utility" provided by $A$ beyond $P$ .

Submodular Mutual Information (SMI):

$f: 2^V \to \mathbb{R}$ 0

Intuition: The amount of “shared information” or representativeness of $f: 2^V \to \mathbb{R}$ 1 with respect to $f: 2^V \to \mathbb{R}$ 2.

Submodular Conditional Mutual Information (SCMI):

$f: 2^V \to \mathbb{R}$ 3

Equivalently, $f: 2^V \to \mathbb{R}$ 4. Intuition: Relevance of $f: 2^V \to \mathbb{R}$ 5 to $f: 2^V \to \mathbb{R}$ 6 penalized by overlap with $f: 2^V \to \mathbb{R}$ 7.

These extend immediately to multi-set analogues, total correlation, and composite objectives. For example, the total correlation of $f: 2^V \to \mathbb{R}$ 8 disjoint sets is $f: 2^V \to \mathbb{R}$ 9 (Majee et al., 2023).

When $f(\emptyset) = 0$ 0 is the entropy of a collection of random variables, these recover classical Shannon-information measures. For canonical submodular functions like coverage, facility-location, concave-over-modular, or certain graph-cut-type objectives, SIMs coincide exactly with entropic mutual information under explicit constructions (Iyer, 19 Jan 2026).

2. Theoretical Properties: Axioms and Independence

SIMs inherit critical properties from submodularity (Asnani et al., 2021, Iyer et al., 2020):

Nonnegativity: $f(\emptyset) = 0$ 1 and $f(\emptyset) = 0$ 2 for normalized, monotone $f(\emptyset) = 0$ 3.
Symmetry: $f(\emptyset) = 0$ 4.
Monotonicity: $f(\emptyset) = 0$ 5 is non-decreasing for fixed $f(\emptyset) = 0$ 6; $f(\emptyset) = 0$ 7 is monotone in $f(\emptyset) = 0$ 8.
Submodularity in One Argument: $f(\emptyset) = 0$ 9 is submodular in $f(A) \leq f(B)$ 0 when $f(A) \leq f(B)$ 1's third discrete derivatives are non-negative; this holds for facility-location, set cover, concave-over-modular, and some graph-cut functions (Iyer et al., 2020, Kothawade et al., 2021).
Chain Rule: $f(A) \leq f(B)$ 2.

Independence concepts are generalized:

Joint Independence: $f(A) \leq f(B)$ 3.
Pairwise Independence: $f(A) \leq f(B)$ 4 if for all $f(A) \leq f(B)$ 5, $f(A) \leq f(B)$ 6.
Multi-set Independence: $f(A) \leq f(B)$ 7 (Asnani et al., 2021).

These fundamental axioms enable the use of SIMs in combinatorial optimization, privacy, summarization, and learning tasks that require formal guarantees.

3. Canonical Submodular Function Classes and Entropic Correspondence

The most widely used SIMs are grounded in the following classes (Iyer, 19 Jan 2026, Iyer et al., 2020):

Function family	$f(A) \leq f(B)$ 8 definition	Typical use cases
Coverage/set-cover	$f(A) \leq f(B)$ 9	Diversity, coverage
Facility-location	$A \subseteq B$ 0	Representation, information overlap
Graph-cut-type	$A \subseteq B$ 1	Redundancy, separation, clustering
Concave-over-mod	$A \subseteq B$ 2, $A \subseteq B$ 3 concave nondecreasing	Robustness, budgeted diversity
Log-determinant	$A \subseteq B$ 4 ( $A \subseteq B$ 5 kernel)	Volume, diversity, uncertainty

Recent work demonstrates exact entropic constructions: given any of these $A \subseteq B$ 6, there exists a random vector $A \subseteq B$ 7 so that $A \subseteq B$ 8, and all submodular information measures reduce to their classical Shannon counterparts (Iyer, 19 Jan 2026).

4. Optimization Algorithms and Greedy Guarantees

Maximization of any nonnegative, monotone SIM (e.g., $A \subseteq B$ 9, $A, B \subseteq V$ 0, $A, B \subseteq V$ 1) under a cardinality or matroid constraint admits a $A, B \subseteq V$ 2-approximation via the greedy algorithm (Kothawade et al., 2022, Kothawade et al., 2021, Kothawade et al., 2021):

Initialize $A, B \subseteq V$ 3.
For $A, B \subseteq V$ $A, B \subseteq V$ 4 to $A, B \subseteq V$ $A, B \subseteq V$ 5:
- For each $A, B \subseteq V$ 6 not in $A, B \subseteq V$ 7, compute marginal gain: e.g., $A, B \subseteq V$ 8.
- Add $A, B \subseteq V$ 9 with maximal $f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ 0 to $f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ 1.

Lazy-greedy and partitioning reduce computational cost, particularly for SMI based on facility-location (requiring only $f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ 2 similarity evaluations) (Kothawade et al., 2022, Kothawade et al., 2021).

Curvature bounds ([curvature $f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ 3]) further tighten approximation ratios. In practice, facility-location, graph-cut, and log-determinant functions exhibit low curvature, making greedy nearly optimal (Kothawade et al., 2022).

5. Applications: Data Selection, Summarization, and Learning

SIMs constitute core objectives in broad machine learning settings:

Active Learning and Data Discovery: SCG and SMI are used to mine rare or unknown classes by rewarding dissimilarity from labeled sets (SCG) and then intensifying discovery by targeting known hits (SMI/SCMI). Empirically, these approaches dominate baselines on rare-class and OOD selection in image classification and object detection, with 10–15% absolute gains in accuracy for unknowns (Kothawade et al., 2022, Kothawade et al., 2021, Kothawade et al., 2022).
Targeted Subset Selection: SMI and variants (facility-location, log-det, graph-cut, COM) select samples that optimally trade off query relevance and target coverage. Theoretical bounds guarantee that maximizing SMI under realistic similarity-separation assumptions ensures high query relevance and coverage (Beck et al., 2024, Kothawade et al., 2021).
Privacy and Fairness: SCMI and its constraints operationalize privacy by enforcing independence from a sensitive set under a user-defined threshold (Asnani et al., 2021, Kaushal et al., 2020). Privacy filters and marginal-independence filters compose efficiently with submodular maximization objectives.
Summarization and Representation Learning: SIMs unify generic, query-focused, privacy- and update-aware summarization as direct maximizations of SMI, SCG, or CSMI, generalizing models such as ROUGE, DPPs, and graph-cut methods (Kaushal et al., 2020, Kothawade et al., 2021). In representation learning, submodular total correlation losses (e.g., SCoRe framework) simultaneously minimize intra-class variance and inter-class bias, outperforming standard contrastive methods for imbalanced data (Majee et al., 2023).

A selection of empirical results:

Application	SIM Instantiation	Typical Gain over Baselines	Reference
Active Data Discovery	Fl_cg+mi, Logdet_cg+mi	10–15% higher accuracy on unknowns	(Kothawade et al., 2022)
OOD Avoidance	Fl-CMI, LogDet-CMI	4–7% accuracy lift	(Kothawade et al., 2022)
Targeted TSS	LogdetMI, FL2MI	~20–30% absolute improvement on rare classes	(Kothawade et al., 2021)
Summarization	FL-SMI, GraphCut-SMI, LogDet-SMI	Near human-level V-ROUGE	(Kaushal et al., 2020)
Representation Learning	FL-/GC -C_f	1–9% boost in class-imblanced recognition	(Majee et al., 2023)

6. Extensions: Causal Inference, Information Inequalities, and Advanced Properties

SIMs extend classical independence, conditional independence, and causal Markov properties to non-entropic settings, unifying information-theoretic and combinatorial perspectives. The generalized causal Markov condition for SIMs matches the standard DAG-based independence structure, independent of the choice of submodular $f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ 4 (Steudel et al., 2010).

Unified derivations of information inequalities (Han’s, Shearer’s, monotonicity sequences, total correlation bounds) follow broadly from submodularity. These yield refined combinatorial bounds, e.g., on projection sizes, Boolean influences, and extremal graph properties (Sason, 2022).

Recent developments include the study of SIMs for weak submodularity in quadratic estimation and optimal experimental design (alphabetic optimality criteria), where closed-form utility functions (log-det, trace, min-eigenvalue) are directly submodular or enjoy quantifiable approximation via greedy (Hashemi et al., 2019).

7. Modeling Flexibility, Parameterizations, and Practical Considerations

Modern extensions, such as PRISM (Kothawade et al., 2021), introduce multi-parameterized SIMs to interpolate between relevance, diversity, privacy, and coverage. Typical parameters:

$f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ 5 (graph-cut): relevance vs. diversity
$f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ 6 (facility-location, COM): similarity-to-query trade-off
$f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ 7 (conditional gain): strength of avoidance/penalty to a private set

By tuning these, SIMs adapt to a wide regime of problems: rare-class mining, guided summarization, OOD filtering, distributed and scalable optimization.

Several concrete choices are supported with efficient greedy algorithms (Kothawade et al., 2021, Kaushal et al., 2020, Beck et al., 2024), and the entire framework is modality-agnostic—applicable to images, video, text, sensor sets, and gradient embeddings.

Summary Table of Core SIM Formulae

Name	Formula	Typical Use
SCG	$f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ 8	Dissimilarity, novelty
SMI	$f(A) + f(B) \geq f(A \cup B) + f(A \cap B)$ 9	Relevance, coverage, overlap
SCMI	$f(A \mid P) := f(A \cup P) - f(P)$ 0	Targeting under exclusion

Through their algebraic generality and foundational approximation guarantees, submodular information measures constitute a principled, tractable, and highly expressive toolkit for information-centric decision-making in structured data systems (Kothawade et al., 2022, Asnani et al., 2021, Iyer et al., 2020, Beck et al., 2024, Kothawade et al., 2021).