Generalization Guarantees for Multi-View Representation Learning and Application to Regularization via Gaussian Product Mixture Prior (2504.18455v1)

Published 25 Apr 2025 in stat.ML, cs.IT, cs.LG, and math.IT

Abstract: We study the problem of distributed multi-view representation learning. In this problem, $K$ agents observe each one distinct, possibly statistically correlated, view and independently extracts from it a suitable representation in a manner that a decoder that gets all $K$ representations estimates correctly the hidden label. In the absence of any explicit coordination between the agents, a central question is: what should each agent extract from its view that is necessary and sufficient for a correct estimation at the decoder? In this paper, we investigate this question from a generalization error perspective. First, we establish several generalization bounds in terms of the relative entropy between the distribution of the representations extracted from training and "test" datasets and a data-dependent symmetric prior, i.e., the Minimum Description Length (MDL) of the latent variables for all views and training and test datasets. Then, we use the obtained bounds to devise a regularizer; and investigate in depth the question of the selection of a suitable prior. In particular, we show and conduct experiments that illustrate that our data-dependent Gaussian mixture priors with judiciously chosen weights lead to good performance. For single-view settings (i.e., $K=1$), our experimental results are shown to outperform existing prior art Variational Information Bottleneck (VIB) and Category-Dependent VIB (CDVIB) approaches. Interestingly, we show that a weighted attention mechanism emerges naturally in this setting. Finally, for the multi-view setting, we show that the selection of the joint prior as a Gaussians product mixture induces a Gaussian mixture marginal prior for each marginal view and implicitly encourages the agents to extract and output redundant features, a finding which is somewhat counter-intuitive.

Summary

The paper derives MDL-based generalization bounds for distributed multi-view learning, showing that redundancy in latent representations can improve generalization.
It introduces MDL-inspired regularizers using Gaussian mixture priors, including a joint regularizer that efficiently aggregates distributed view statistics.
Experimental validation on datasets like CIFAR and USPS demonstrates that the proposed methods outperform standard VIB approaches in test accuracy.

This paper investigates the problem of distributed multi-view representation learning from a generalization error perspective. In this setup, $K$ agents (encoders) each observe a distinct view ( $X_1, ..., X_K$ ) of potentially correlated data. Each agent independently computes a representation ( $U_1, ..., U_K$ ). These representations are then sent to a central decoder which predicts a label ( $Y$ ). A key challenge is that encoders cannot explicitly coordinate, yet their combined representations must be sufficient for accurate prediction while ensuring good generalization to unseen data.

The paper makes several contributions:

Generalization Bounds based on Minimum Description Length (MDL):
- It establishes generalization error bounds for the multi-view setting. Unlike prior work often based on the Information Bottleneck (IB) principle (which relates generalization to mutual information, a link questioned by recent research), these bounds depend on the Minimum Description Length (MDL) of the latent representations ( $U_1, ..., U_K$ ) relative to a symmetric prior distribution $Q$ .
- The first bound (Theorem 1, adapted from prior work) shows the generalization error scales roughly as $\sqrt{\text{MDL}(Q)/n}$ , where $n$ is the dataset size.
- A tighter bound (Theorem 4) is derived, decaying faster, approximately as $\text{MDL}(Q)/n$ in realizable cases. This bound uses the Jensen-Shannon divergence ( $h_D$ ) between training and test errors.
- A third bound (Theorem 5) explicitly separates the contributions of marginal MDL (for each view) and joint MDL (across views). It shows that the joint MDL term enters negatively, implying that statistical correlation (redundancy) among the latent variables $U_k$ can improve generalization, contrasting with naive approaches that might favor complementary features.
- Tail bounds (Theorem 6) and lossy versions of the bounds (Section 3.3) are also discussed, providing robustness even for deterministic encoders.
MDL-Inspired Regularizers using Gaussian Mixture Priors:
- Motivated by the bounds, the paper proposes using $\text{MDL}(Q)$ as a regularizer during training to improve generalization.
- To overcome challenges like needing a data-dependent prior $Q$ and handling high-dimensional/distributed data, it proposes modeling the prior $Q$ using Gaussian mixtures.
- Single-View Case (K=1):
  - A method is developed to learn a category-dependent Gaussian mixture prior $Q_c = \sum_m \alpha_{c,m} \mathcal{N}(\mu_{c,m}, \diag(\sigma_{c,m}^2))$ alongside the main training objective.
  - It presents "lossless" and "lossy" approaches. The parameters ( $\alpha, \mu, \sigma$ ) of the mixture components are updated iteratively based on the current latent variables produced by the encoder for a mini-batch, resembling steps in an EM algorithm but adapted for regularization (Eq. 8, 9, 10).
  - The resulting regularizer approximates $D_{KL}(P_{U|X, W_e} || Q_Y)$ (Eq. 11, 12).
  - Interestingly, the update mechanism in the lossy version (Eq. 13) resembles a weighted self-attention mechanism, where mixture components "attend" to latent variables.
- Multi-View Case (K>1):
  - A naive "marginals-only" regularizer ( $\sum_k \text{Regularizer}(Q_k)$ ) is discussed but criticized for penalizing redundancy.
  - The core proposal is a joint regularizer based on a Gaussians Product Mixture Prior:
    
    $Q_c = \sum_{m^K \in [M]^K} \alpha_{c, m^K} \prod_{k=1}^K Q_{c, k, m_k}$ , where $Q_{c,k,m_k} = \mathcal{N}(\mu_{c,k,m_k}, \diag(\sigma_{c,k,m_k}^2))$.
  - This structure ensures marginal priors for each view are still Gaussian mixtures.
  - Crucially, it captures joint statistics and penalizes redundancy less severely than the marginals-only approach (shown theoretically in Appendix B.2).
  - An efficient, distributed update mechanism is provided (Appendix C): clients compute marginal statistics/KL terms, the server aggregates these to update joint coefficients ( $\alpha_{c,m^K}$ ) and compute the regularizer value (Eq. 18), and clients update their local prior components ( $\mu_{c,k,m}, \sigma_{c,k,m}$ ) and encoders.
  - The update involves a form of distributed weighted attention across views and mixture components.
Experimental Validation:
- Experiments were conducted on CIFAR10, CIFAR100, USPS, and INTEL datasets using CNN and ResNet18 encoders. Multi-view data was simulated by applying various distortions (noise, transformations, occlusion) independently to copies of single images.
- Single-view: The proposed Gaussian Mixture MDL (GM-MDL) regularizer outperformed VIB and Category-Dependent VIB (CDVIB) in terms of test accuracy (Table 2).
- Multi-view: The proposed Gaussians Product Mixture MDL (GPM-MDL) regularizer outperformed both no regularization and applying per-view VIB regularization across various distortion scenarios, numbers of views, and datasets (Table 4), demonstrating its practical effectiveness.

In summary, the paper provides theoretical justification (MDL-based generalization bounds) for encouraging redundancy in distributed multi-view learning and introduces a practical regularization method (GPM-MDL) based on learning data-dependent Gaussian product mixture priors, which demonstrably improves performance in experiments. The method uses efficient, distributed updates involving attention-like mechanisms.