Contrastive Log-Ratio Upper Bound (CLUB)

Updated 2 March 2026

CLUB is a method that tightly upper bounds mutual information by contrasting log-likelihoods of matching and mismatched data pairs, ensuring unbiased estimation under independence.
It employs a variational approximation (vCLUB) to substitute p(x|y) with a neural network model, enabling closed-form computations in exponential family cases.
The framework improves scalability via negative sampling and demonstrates superior performance in synthetic experiments, representation learning, and domain adaptation.

The Contrastive Log-ratio Upper Bound (CLUB) is a framework for estimating and minimizing mutual information (MI) in high-dimensional scenarios where only samples from the relevant distributions, rather than their explicit forms, are available. CLUB constructs a tight, variationally-approximable upper bound on MI that enables stable and scalable MI minimization—tasks where lower bound estimators are inapplicable. The CLUB methodology addresses the estimation bias, computational scalability, and numerical instability that afflict earlier MI upper bound strategies, thereby facilitating its use in representation learning, domain adaptation, and information bottleneck contexts (Cheng et al., 2020).

1. Formal Definition of CLUB

Let $(X, Y)$ be random variables with joint distribution $p(x, y)$ . Mutual information is defined as

$I(X; Y) = \mathbb{E}_{p(x, y)} \left[ \log \frac{p(x, y)}{p(x)p(y)} \right] = \mathbb{E}_{p(x, y)} \left[ \log \frac{p(x \mid y)}{p(x)} \right].$

The CLUB estimator presumes access to the conditional density $p(x \mid y)$ and defines

$I_{\mathrm{CLUB}}(X; Y) = \mathbb{E}_{p(x, y)} [\log p(x \mid y)] - \mathbb{E}_{p(x)p(y)} [\log p(x \mid y)].$

Given $N$ i.i.d. samples $\{(x_i, y_i)\}_{i=1}^N$ , the empirical estimate is

$\hat I_{\mathrm{CLUB}} = \frac{1}{N} \sum_{i=1}^N \log p(x_i | y_i) - \frac{1}{N^2} \sum_{i=1}^N \sum_{j=1}^N \log p(x_j | y_i) = \frac{1}{N^2} \sum_{i, j} \left[\log p(x_i | y_i) - \log p(x_j | y_i)\right].$

This estimator relies on the explicit evaluation or modeling of $p(x|y)$ , and operates by contrasting log-likelihoods across matched and mismatched pairs.

2. Upper Bound Derivation and Tightness

Define the gap $\Delta = I_{\mathrm{CLUB}}(X; Y) - I(X; Y)$ . The derivation proceeds as follows: $\Delta = \mathbb{E}_{p(x,y)}\log p(x) - \mathbb{E}_{p(x)p(y)}\log p(x|y) = \mathbb{E}_{p(y)}\left[\log p(x) - \mathbb{E}_{p(x)}\log p(x|y)\right].$ By concavity of $\log$ and Jensen’s inequality, $\log p(x) = \log \mathbb{E}_{p(y)}[p(x|y)] \geq \mathbb{E}_{p(y)}[\log p(x|y)]$ , guaranteeing that $\Delta \geq 0$ . Thus,

$I(X; Y) \leq I_{\mathrm{CLUB}}(X; Y),$

with equality if and only if $p(x|y)$ is independent of $y$ , i.e., $X \perp Y$ . The bound is tight for independent variables and grows otherwise; the magnitude reflects the deviation from independence.

3. Variational Approximation: vCLUB

When $p(x|y)$ is unknown or intractable, CLUB is made practical via a variational approximation. A conditional model $q_\theta(x|y)$ (typically parameterized by neural networks) is introduced: $I_{\mathrm{vCLUB}}(X; Y) = \mathbb{E}_{p(x, y)}[\log q_\theta(x|y)] - \mathbb{E}_{p(x)p(y)}[\log q_\theta(x|y)],$ with the sample estimator

$\hat I_{\mathrm{vCLUB}} = \frac{1}{N^2} \sum_{i, j} [\log q_\theta(x_i | y_i) - \log q_\theta(x_j | y_i)].$

For parametric exponential family forms, such as Gaussians with mean $\mu_\theta(y)$ and diagonal covariance, this estimator admits closed-form computations based on Mahalanobis distances.

4. Theoretical Properties and Bias Analysis

The theoretical guarantees of CLUB are as follows:

Exactness: $I(X; Y) \leq I_{\mathrm{CLUB}}(X; Y)$ , with equality if and only if $X \perp Y$ .
vCLUB as Upper Bound: Let $q_\theta(x, y) = p(x)q_\theta(x|y)$ . If $\mathrm{KL}(p(x, y)\|q_\theta(x, y)) \leq \mathrm{KL}(p(x)p(y)\|q_\theta(x, y))$ , then $I(X;Y) \leq I_{\mathrm{vCLUB}}(X;Y)$ .
Approximation Error: If, in addition, $\mathrm{KL}(p(x|y)\|q_\theta(x|y)) \leq \epsilon$ and $\mathrm{KL}(p(x, y)\|q_\theta(x, y)) > \mathrm{KL}(p(x)p(y)\|q_\theta(x, y))$ , then $|I(X;Y) - I_{\mathrm{vCLUB}}(X;Y)| < \epsilon$ .

A plausible implication is that strong variational modeling of $p(x|y)$ ensures not only the validity but the tightness of the CLUB upper bound relative to the true mutual information.

5. Scalable MI Minimization Training: Negative Sampling Scheme

The original estimator is quadratic in $N$ . CLUB–S and vCLUB–S accelerate computation via negative sampling, lowering the complexity to $O(N)$ . Training involves two main phases:

Conditional Model Update: Fit $q_\theta(x|y)$ to maximize conditional log-likelihood over data batches.
MI Minimization via Negative Sampling: For each positive $(x_i, y_i)$ pair, one negative is sampled, constructing $U_i = \log q_\theta(x_i|y_i) - \log q_\theta(x_{k_i}|y_i)$ . Then,

$\hat I = \frac{1}{N} \sum_{i=1}^N U_i$

is minimized w.r.t. the generative model parameters $\sigma$ .

Initialize model p_σ(x,y) and encoder q_θ(x|y).
repeat
  Sample a batch {(x_i, y_i)}_{i=1}^N from p_σ.
  // (1) Update q_θ by maximizing conditional log‐likelihood
  L(θ) ← (1/N) ∑_{i=1}^N log q_θ(x_i|y_i).
  θ ← θ + η ∇_θ L(θ).
  // (2) Compute one‒negative vCLUB estimate
  for i=1…N do
    draw k_i ∼ Uniform({1,…,N});
    U_i ← log q_θ(x_i|y_i) - log q_θ(x_{k_i}|y_i).
  end for
  Ĩ ← (1/N) ∑_{i=1}^N U_i.
  // (3) Update σ by minimizing Ĩ
  σ ← σ − η′ ∇_σ Ĩ   (backprop through samples via reparametrization).
until convergence

This sampled variant is unbiased and substantially more scalable. The negative sampling strategy improves both statistical and computational characteristics of the bound.

6. Empirical Evaluation and Practical Performance

CLUB and its variational extensions have undergone validation on both synthetic and real-world benchmarks:

Synthetic Estimation: On Gaussian/Cubic data ( $d=20$ , true $I \in \{2,4,6,8,10\}$ ), CLUB attains the lowest bias and minimum squared error compared to competing lower and upper bound MI estimators. CLUB–S incurs slightly higher variance but maintains unbiasedness.
Information Bottleneck (MNIST, latent dim=256): CLUB and vCLUB achieve lower test classification error (approaching $1.06\%$ ) than DVB (VUB), MINE, NWJ, InfoNCE, and Leave-One-Out estimators. Negative sampling further enhances generalization.
Unsupervised Domain Adaptation: In MNIST $\rightarrow$ MNIST-M and USPS $\rightarrow$ MNIST settings, within a disentangled representation objective minimizing $I(z_c; z_d)$ , CLUB–S provides the highest or near-highest target domain accuracy ( $94.6\%$ – $98.9\%$ ), surpassing lower-bound and prior upper-bound estimators that suffer from numerical instability.

The aggregate findings demonstrate that CLUB delivers a tight, stable, and computationally efficient upper bound for MI estimation and minimization in high-dimensional deep learning tasks (Cheng et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Log-ratio Upper Bound (CLUB).