Submodular Conditional Mutual Information

Updated 21 April 2026

SCMI is a generalization of classical conditional mutual information, replacing Shannon entropy with a submodular set function to quantify dependencies across various objects.
It unifies entropic, algorithmic, and combinatorial notions by leveraging properties like non-negativity, symmetry, and the chain rule, making it suitable for tasks such as causal inference and summarization.
SCMI supports scalable optimization through greedy algorithms and practical instantiations like facility-location and set cover, enhancing machine learning applications in active learning and clustering.

Submodular Conditional Mutual Information (SCMI) is a generalization of classical conditional mutual information in which the additive, monotone, submodular set function replaces the role of Shannon entropy. SCMI enables the quantification of dependence and information flow among arbitrary collections of objects—such as random variables, strings, sets, or more abstract combinatorial entities—by leveraging the structure and properties of submodularity. It unifies entropic, algorithmic, and combinatorial notions of information, and underpins modern algorithmic frameworks for causal inference, summarization, and query-driven subset selection in both classical and non-i.i.d. settings (Steudel et al., 2010, Kothawade et al., 2021, Iyer et al., 2020).

1. Formal Definition

Given a finite ground set $V$ and a normalized, monotone, submodular set function $f: 2^V \to \mathbb{R}_+$ , the submodular conditional mutual information of subsets $A, B, C \subseteq V$ is defined as: $I_f(A;B | C) = f(A \cup C) + f(B \cup C) - f(C) - f(A \cup B \cup C)$ This measures the incremental value of jointly considering $A$ and $B$ given $C$ , over and above their individual contributions conditioned on $C$ (Steudel et al., 2010, Kothawade et al., 2021, Iyer et al., 2020).

If $f$ is the Shannon entropy, this recovers the standard conditional mutual information for random variables: $I(X_A; X_B | X_C) = H(A, C) + H(B, C) - H(C) - H(A, B, C)$ If $f: 2^V \to \mathbb{R}_+$ 0 is the Kolmogorov complexity (up to additive logarithmic terms), this recovers algorithmic conditional mutual information for finite strings (Steudel et al., 2010).

2. Theoretical Properties

The axiomatic submodularity, monotonicity, and normalization of $f: 2^V \to \mathbb{R}_+$ 1 endow SCMI with several key properties (Steudel et al., 2010, Iyer et al., 2020, Kothawade et al., 2021):

Non-negativity: $f: 2^V \to \mathbb{R}_+$ 2 for all $f: 2^V \to \mathbb{R}_+$ 3.
Symmetry: $f: 2^V \to \mathbb{R}_+$ 4.
Chain Rule: For any $f: 2^V \to \mathbb{R}_+$ 5:

$f: 2^V \to \mathbb{R}_+$ 6

Monotonicity/Submodularity: For any fixed $f: 2^V \to \mathbb{R}_+$ 7, $f: 2^V \to \mathbb{R}_+$ 8 is monotone and, for functions satisfying a third-order supermodularity condition, also submodular (Iyer et al., 2020).
Data-Processing Inequality: If $f: 2^V \to \mathbb{R}_+$ 9, then for all $A, B, C \subseteq V$ 0, $A, B, C \subseteq V$ 1 if $A, B, C \subseteq V$ 2.

Submodular SCMI extends the operational and combinatorial toolkit available for classical mutual information, encompassing settings defined by combinatorial optimization, algorithmic information theory, and non-stochastic data (Steudel et al., 2010, Iyer et al., 2020).

3. Representative Instantiations

Many widely used submodular functions yield concrete, tractable SCMI expressions. Key examples include (Kothawade et al., 2021, Iyer et al., 2020):

Model	Submodular Function $A, B, C \subseteq V$ 3	SCMI $A, B, C \subseteq V$ 4
Facility-Location	$A, B, C \subseteq V$ 5	$A, B, C \subseteq V$ 6
Set Cover	$A, B, C \subseteq V$ 7	$A, B, C \subseteq V$ 8
Graph Cut	$A, B, C \subseteq V$ 9	See (Iyer et al., 2020) for explicit formula

These models permit the application of SCMI to information summarization, feature selection, and batch active learning, among other tasks (Iyer et al., 2020, Kothawade et al., 2021).

4. Computational Estimation and Algorithms

In practice, SCMI is often estimated using surrogate functions tailored to the application domain or empirical data constraints. For algorithmic information, computable proxies based on data compression schemes, such as Lempel–Ziv complexity or grammar-based representations, serve as practical alternatives to uncomputable Kolmogorov complexity (Steudel et al., 2010). For combinatorial models, facility-location, set cover, and graph-cut instantiations are directly evaluable from the underlying similarity or coverage structures (Kothawade et al., 2021, Iyer et al., 2020).

Optimization tasks involving SCMI, such as maximizing $I_f(A;B | C) = f(A \cup C) + f(B \cup C) - f(C) - f(A \cup B \cup C)$ 0 under cardinality or knapsack constraints, leverage the monotone submodular structure to enable greedy, lazy-greedy, or stochastic-greedy algorithms with provable $I_f(A;B | C) = f(A \cup C) + f(B \cup C) - f(C) - f(A \cup B \cup C)$ 1-approximation guarantees when the underlying function admits the appropriate third-order submodularity (second-order supermodular) conditions (Kothawade et al., 2021, Iyer et al., 2020).

5. Connections to Classical and Algorithmic Information Theory

SCMI generalizes both Shannon-theoretic and algorithmic notions of conditional mutual information (Steudel et al., 2010). When $I_f(A;B | C) = f(A \cup C) + f(B \cup C) - f(C) - f(A \cup B \cup C)$ 2 is Shannon entropy, SCMI reduces to ordinary conditional mutual information for random variables. When $I_f(A;B | C) = f(A \cup C) + f(B \cup C) - f(C) - f(A \cup B \cup C)$ 3 is Kolmogorov complexity, SCMI yields algorithmic CMI for deterministic objects, modulo additive constants. Other choices of $I_f(A;B | C) = f(A \cup C) + f(B \cup C) - f(C) - f(A \cup B \cup C)$ 4 model gain, redundancy, and coverage over sets, generalizing information-theoretic concepts to arbitrary combinatorial domains (Iyer et al., 2020).

SCMI thereby provides a unified abstraction encompassing stochastic, deterministic, and combinatorially structured data, admitting direct analogues to chain rules, conditional independence, and data-processing inequalities (Steudel et al., 2010, Iyer et al., 2020).

6. Applications in Machine Learning and Causal Inference

SCMI underpins a diverse array of optimization-based machine learning tasks (Iyer et al., 2020, Kothawade et al., 2021):

Causal Discovery: SCMI-based conditional independence tests can be directly integrated into constraint-based algorithms such as PC or FCI, applicable to arbitrary objects, including strings and time-series, using compression-based estimators (Steudel et al., 2010).
Active Learning: SIMILAR, an SCMI-based batch active learning framework, targets rare classes, avoids redundancy, and filters out out-of-distribution data by selecting points maximizing SCMI with respect to task-specific label and exclusion sets. The underlying optimization leverages the submodularity and monotonicity of SCMI, ensuring efficient and scalable acquisition (Kothawade et al., 2021).
Data Summarization and Diversification: Maximizing SCMI captures relevance to query sets while enforcing diversity with respect to auxiliary sets; privacy-preserving summarization is formulated by subtracting SCMI terms corresponding to sensitive sets (Iyer et al., 2020).
Partitioning and Clustering: Multi-way generalizations of SCMI (total correlation, k-way mutual information) support clustering, robust partitioning, and submodular welfare problems by quantifying dependence among groupings of elements (Iyer et al., 2020).

7. Illustrative Examples and Empirical Surrogates

Examples showcase the breadth of SCMI applicability:

For period-length data, with $I_f(A;B | C) = f(A \cup C) + f(B \cup C) - f(C) - f(A \cup B \cup C)$ 5, the SCMI expression simplifies to differences in logarithmic GCDs, exposing new and shared periodicities, directly interpretable in the combinatorial domain (Steudel et al., 2010).
For sequences of texts (e.g., translations), Lempel–Ziv-based SCMI proxies empirically verify conditional independence chains, facilitating recovery of causal order via constraint-based algorithms (Steudel et al., 2010).
In active learning datasets, facility-location and set-cover SCMI instantiations yield closed forms for batch selection that concomitantly balance query relevance, diversity, and redundancy avoidance (Kothawade et al., 2021).

The practical significance of SCMI derives from both its theoretical guarantees—reflecting foundational information-theoretic properties—and its flexibility, enabling scalable, adaptive algorithms across diverse, structurally complex data regimes (Steudel et al., 2010, Kothawade et al., 2021, Iyer et al., 2020).