Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive Log-Ratio Upper Bound (CLUB)

Updated 2 March 2026
  • CLUB is a method that tightly upper bounds mutual information by contrasting log-likelihoods of matching and mismatched data pairs, ensuring unbiased estimation under independence.
  • It employs a variational approximation (vCLUB) to substitute p(x|y) with a neural network model, enabling closed-form computations in exponential family cases.
  • The framework improves scalability via negative sampling and demonstrates superior performance in synthetic experiments, representation learning, and domain adaptation.

The Contrastive Log-ratio Upper Bound (CLUB) is a framework for estimating and minimizing mutual information (MI) in high-dimensional scenarios where only samples from the relevant distributions, rather than their explicit forms, are available. CLUB constructs a tight, variationally-approximable upper bound on MI that enables stable and scalable MI minimization—tasks where lower bound estimators are inapplicable. The CLUB methodology addresses the estimation bias, computational scalability, and numerical instability that afflict earlier MI upper bound strategies, thereby facilitating its use in representation learning, domain adaptation, and information bottleneck contexts (Cheng et al., 2020).

1. Formal Definition of CLUB

Let (X,Y)(X, Y) be random variables with joint distribution p(x,y)p(x, y). Mutual information is defined as

I(X;Y)=Ep(x,y)[logp(x,y)p(x)p(y)]=Ep(x,y)[logp(xy)p(x)].I(X; Y) = \mathbb{E}_{p(x, y)} \left[ \log \frac{p(x, y)}{p(x)p(y)} \right] = \mathbb{E}_{p(x, y)} \left[ \log \frac{p(x \mid y)}{p(x)} \right].

The CLUB estimator presumes access to the conditional density p(xy)p(x \mid y) and defines

ICLUB(X;Y)=Ep(x,y)[logp(xy)]Ep(x)p(y)[logp(xy)].I_{\mathrm{CLUB}}(X; Y) = \mathbb{E}_{p(x, y)} [\log p(x \mid y)] - \mathbb{E}_{p(x)p(y)} [\log p(x \mid y)].

Given NN i.i.d. samples {(xi,yi)}i=1N\{(x_i, y_i)\}_{i=1}^N, the empirical estimate is

I^CLUB=1Ni=1Nlogp(xiyi)1N2i=1Nj=1Nlogp(xjyi)=1N2i,j[logp(xiyi)logp(xjyi)].\hat I_{\mathrm{CLUB}} = \frac{1}{N} \sum_{i=1}^N \log p(x_i | y_i) - \frac{1}{N^2} \sum_{i=1}^N \sum_{j=1}^N \log p(x_j | y_i) = \frac{1}{N^2} \sum_{i, j} \left[\log p(x_i | y_i) - \log p(x_j | y_i)\right].

This estimator relies on the explicit evaluation or modeling of p(xy)p(x|y), and operates by contrasting log-likelihoods across matched and mismatched pairs.

2. Upper Bound Derivation and Tightness

Define the gap Δ=ICLUB(X;Y)I(X;Y)\Delta = I_{\mathrm{CLUB}}(X; Y) - I(X; Y). The derivation proceeds as follows: Δ=Ep(x,y)logp(x)Ep(x)p(y)logp(xy)=Ep(y)[logp(x)Ep(x)logp(xy)].\Delta = \mathbb{E}_{p(x,y)}\log p(x) - \mathbb{E}_{p(x)p(y)}\log p(x|y) = \mathbb{E}_{p(y)}\left[\log p(x) - \mathbb{E}_{p(x)}\log p(x|y)\right]. By concavity of log\log and Jensen’s inequality, logp(x)=logEp(y)[p(xy)]Ep(y)[logp(xy)]\log p(x) = \log \mathbb{E}_{p(y)}[p(x|y)] \geq \mathbb{E}_{p(y)}[\log p(x|y)], guaranteeing that Δ0\Delta \geq 0. Thus,

I(X;Y)ICLUB(X;Y),I(X; Y) \leq I_{\mathrm{CLUB}}(X; Y),

with equality if and only if p(xy)p(x|y) is independent of yy, i.e., XYX \perp Y. The bound is tight for independent variables and grows otherwise; the magnitude reflects the deviation from independence.

3. Variational Approximation: vCLUB

When p(xy)p(x|y) is unknown or intractable, CLUB is made practical via a variational approximation. A conditional model qθ(xy)q_\theta(x|y) (typically parameterized by neural networks) is introduced: IvCLUB(X;Y)=Ep(x,y)[logqθ(xy)]Ep(x)p(y)[logqθ(xy)],I_{\mathrm{vCLUB}}(X; Y) = \mathbb{E}_{p(x, y)}[\log q_\theta(x|y)] - \mathbb{E}_{p(x)p(y)}[\log q_\theta(x|y)], with the sample estimator

I^vCLUB=1N2i,j[logqθ(xiyi)logqθ(xjyi)].\hat I_{\mathrm{vCLUB}} = \frac{1}{N^2} \sum_{i, j} [\log q_\theta(x_i | y_i) - \log q_\theta(x_j | y_i)].

For parametric exponential family forms, such as Gaussians with mean μθ(y)\mu_\theta(y) and diagonal covariance, this estimator admits closed-form computations based on Mahalanobis distances.

4. Theoretical Properties and Bias Analysis

The theoretical guarantees of CLUB are as follows:

  • Exactness: I(X;Y)ICLUB(X;Y)I(X; Y) \leq I_{\mathrm{CLUB}}(X; Y), with equality if and only if XYX \perp Y.
  • vCLUB as Upper Bound: Let qθ(x,y)=p(x)qθ(xy)q_\theta(x, y) = p(x)q_\theta(x|y). If KL(p(x,y)qθ(x,y))KL(p(x)p(y)qθ(x,y))\mathrm{KL}(p(x, y)\|q_\theta(x, y)) \leq \mathrm{KL}(p(x)p(y)\|q_\theta(x, y)), then I(X;Y)IvCLUB(X;Y)I(X;Y) \leq I_{\mathrm{vCLUB}}(X;Y).
  • Approximation Error: If, in addition, KL(p(xy)qθ(xy))ϵ\mathrm{KL}(p(x|y)\|q_\theta(x|y)) \leq \epsilon and KL(p(x,y)qθ(x,y))>KL(p(x)p(y)qθ(x,y))\mathrm{KL}(p(x, y)\|q_\theta(x, y)) > \mathrm{KL}(p(x)p(y)\|q_\theta(x, y)), then I(X;Y)IvCLUB(X;Y)<ϵ|I(X;Y) - I_{\mathrm{vCLUB}}(X;Y)| < \epsilon.

A plausible implication is that strong variational modeling of p(xy)p(x|y) ensures not only the validity but the tightness of the CLUB upper bound relative to the true mutual information.

5. Scalable MI Minimization Training: Negative Sampling Scheme

The original estimator is quadratic in NN. CLUB–S and vCLUB–S accelerate computation via negative sampling, lowering the complexity to O(N)O(N). Training involves two main phases:

  1. Conditional Model Update: Fit qθ(xy)q_\theta(x|y) to maximize conditional log-likelihood over data batches.
  2. MI Minimization via Negative Sampling: For each positive (xi,yi)(x_i, y_i) pair, one negative is sampled, constructing Ui=logqθ(xiyi)logqθ(xkiyi)U_i = \log q_\theta(x_i|y_i) - \log q_\theta(x_{k_i}|y_i). Then,

I^=1Ni=1NUi\hat I = \frac{1}{N} \sum_{i=1}^N U_i

is minimized w.r.t. the generative model parameters σ\sigma.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Initialize model p_σ(x,y) and encoder q_θ(x|y).
repeat
  Sample a batch {(x_i, y_i)}_{i=1}^N from p_σ.
  // (1) Update q_θ by maximizing conditional loglikelihood
  L(θ)  (1/N) _{i=1}^N log q_θ(x_i|y_i).
  θ  θ + η _θ L(θ).
  // (2) Compute onenegative vCLUB estimate
  for i=1N do
    draw k_i  Uniform({1,,N});
    U_i  log q_θ(x_i|y_i) - log q_θ(x_{k_i}|y_i).
  end for
  Ĩ  (1/N) _{i=1}^N U_i.
  // (3) Update σ by minimizing Ĩ
  σ  σ  η _σ Ĩ   (backprop through samples via reparametrization).
until convergence

This sampled variant is unbiased and substantially more scalable. The negative sampling strategy improves both statistical and computational characteristics of the bound.

6. Empirical Evaluation and Practical Performance

CLUB and its variational extensions have undergone validation on both synthetic and real-world benchmarks:

  • Synthetic Estimation: On Gaussian/Cubic data (d=20d=20, true I{2,4,6,8,10}I \in \{2,4,6,8,10\}), CLUB attains the lowest bias and minimum squared error compared to competing lower and upper bound MI estimators. CLUB–S incurs slightly higher variance but maintains unbiasedness.
  • Information Bottleneck (MNIST, latent dim=256): CLUB and vCLUB achieve lower test classification error (approaching 1.06%1.06\%) than DVB (VUB), MINE, NWJ, InfoNCE, and Leave-One-Out estimators. Negative sampling further enhances generalization.
  • Unsupervised Domain Adaptation: In MNIST\rightarrowMNIST-M and USPS\rightarrowMNIST settings, within a disentangled representation objective minimizing I(zc;zd)I(z_c; z_d), CLUB–S provides the highest or near-highest target domain accuracy (94.6%94.6\%98.9%98.9\%), surpassing lower-bound and prior upper-bound estimators that suffer from numerical instability.

The aggregate findings demonstrate that CLUB delivers a tight, stable, and computationally efficient upper bound for MI estimation and minimization in high-dimensional deep learning tasks (Cheng et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Log-ratio Upper Bound (CLUB).