Papers
Topics
Authors
Recent
2000 character limit reached

Bilateral Distribution Compression

Updated 23 September 2025
  • Bilateral Distribution Compression is a paradigm that simultaneously compresses the sample and feature axes to maintain distributional fidelity.
  • It uses a two-stage process: training a distribution-sensitive autoencoder to learn a latent representation and then compressing the latent space via EMMD minimization.
  • The method guarantees that controlling RMMD and EMMD ensures low decoded distribution error (DMMD), enabling scalable, accurate data coresets.

Bilateral Distribution Compression (BDC) is a data compression paradigm in which both sample size and ambient feature dimensionality are reduced, subject to rigorous distributional fidelity constraints. Unlike classical approaches that act solely on the sample axis (subsampling, coreset selection) or the feature axis (dimensionality reduction), BDC seeks to maintain the original data distribution through a two-stage optimization: first, it learns a low-dimensional latent embedding sensitive to distributional structure; second, it constructs a compressed set in latent space whose decoded images closely approximate the distribution of the original data. This is formalized through the control of Maximum Mean Discrepancy (MMD) distances measured at key points in the compression pipeline.

1. Motivation and Foundations

Traditional distribution compression methods (e.g., kernel herding, greedy MMD minimization, ambient coreset selection) operate within the full ambient space and generally scale poorly when both the number of samples (nn) and feature dimension (dd) are large. Methods such as PCA offer dimensionality reduction, but unless combined with targeted sampling, downstream models may lose critical distributional properties. BDC is designed for the modern regime in which both axes are large and potentially redundant, enabling linear time and memory scaling in nn and dd via latent-space compression.

2. Central Metrics: RMMD, EMMD, DMMD

BDC introduces three kernel-based distributional distances, each serving a distinct function:

Metric Purpose Domain
RMMD Measure embedding fidelity Ambient/latent
EMMD Quantify compressed set alignment Latent
DMMD Evaluate decoded distribution error Ambient
  • Reconstruction MMD (RMMD):

RMMD(PX,Pφ(ψ(X)))\mathrm{RMMD}(P_X, P_{\varphi(\psi(X))})

measures the distance between the original distribution PXP_X and its reconstruction φ(ψ(X))\varphi(\psi(X)) by the encoder ψ\psi and decoder φ\varphi. In the linear case (projection VV subject to VV=IV^\top V=I), minimizing RMMD induces a subspace akin to principal components under quadratic kernels.

  • Encoded MMD (EMMD):

EMMD(Pψ(X),PZ)\mathrm{EMMD}(P_{\psi(X)}, P_Z)

quantifies the distributional discrepancy between embedded data and the proposed compressed set ZZ in latent space.

  • Decoded MMD (DMMD):

DMMD(PX,Pφ(Z))\mathrm{DMMD}(P_X, P_{\varphi(Z)})

is the ultimate measure of fidelity between the original and decoded compressed sets, guiding the construction of coresets faithful to the empirical or target distribution.

A key theoretical result is:

DMMD(PX,Pφ(Z))RMMD(PX,Pφ(ψ(X)))+EMMD(Pψ(X),PZ)\mathrm{DMMD}(P_X, P_{\varphi(Z)}) \leq \mathrm{RMMD}(P_X, P_{\varphi(\psi(X))}) + \mathrm{EMMD}(P_{\psi(X)}, P_Z)

which ensures that controlling RMMD and EMMD suffices for overall fidelity.

3. Two-Stage Compression Procedure

Stage 1: Distributional Autoencoder Training

  • An encoder ψ:RdRp\psi: \mathbb{R}^d \rightarrow \mathbb{R}^p and decoder φ:RpRd\varphi: \mathbb{R}^p \rightarrow \mathbb{R}^d are trained jointly on XX to minimize RMMD (possibly in convex combination with mean squared reconstruction error, MSRE).
  • For linear models, optimization is performed on the Stiefel manifold; for nonlinear cases (BDC-NL), standard neural architectures are used under bottleneck constraints.

Stage 2: Latent Set Compression

  • Encoded representations Zdata=ψ(X)Z_{data} = \psi(X) are computed.
  • A compressed set Z={z1,,zm}RpZ = \{z_1, \ldots, z_m\} \subset \mathbb{R}^p (mnm \ll n) is initialized (e.g., by sampling latent codes) and optimized to minimize EMMD w.r.t. ZdataZ_{data}; the latent kernel h(z,z)h(z,z') may be pulled back from the ambient kernel k(φ(z),φ(z))k(\varphi(z), \varphi(z')).
  • The decoded set φ(Z)\varphi(Z) constitutes the final compressed dataset.

4. Complexity, Performance, and Practical Advantages

The BDC pipeline offers linear scaling (O(nd)\mathcal{O}(nd)) in both sample size and dimension. Empirical studies across synthetic manifolds, images (MNIST, CT-Slice), and real-world regression/classification settings demonstrate that BDC matches or outperforms ambient-space compression (ADC) and kernel herding, particularly in preserving distributional features (i.e., DMMD scores).

Key empirical findings:

  • BDC maintains distributional fidelity even with aggressive reduction in both nn and dd.
  • For PCA (quadratic kernel), linear autoencoder BDC-L yields principal subspaces; nonlinear autoencoder BDC-NL gives finer manifold alignment.
  • In supervised extensions (via tensor-product kernels), joint features and responses are compressed (RJMMD, EJMMD, DJMMD), facilitating label-preserving coresets.

5. Theoretical Guarantees

BDC is supported by theoretical certification:

  • If both RMMD and EMMD are sufficiently small (ideally, tending to zero), DMMD is also small, ensuring the decoded set is distributionally close to the source.
  • The bound

DMMD(PX,Pφ(Z))RMMD(PX,Pφ(ψ(X)))+EMMD(Pψ(X),PZ)\mathrm{DMMD}(P_X, P_{\varphi(Z)}) \leq \mathrm{RMMD}(P_X, P_{\varphi(\psi(X))}) + \mathrm{EMMD}(P_{\psi(X)}, P_Z)

serves as the main guarantee, motivating the two-stage minimization.

6. Extensions, Limitations, and Application Domains

  • Supervised, Semi-supervised, or Conditional Compression: By employing tensor-product kernels, BDC can be adapted to preserve joint feature-label distributions.
  • Manifold assumption: BDC is most effective when data lies approximately on a low-dimensional manifold. When manifold structure is absent, the effectiveness of feature compression may degrade.
  • Architecture: Linear BDC (BDC-L) suffices and is preferred when manifold structure is simple, while nonlinear BDC (BDC-NL) targets more complex geometries but may introduce training variability.

BDC generalizes and subsumes several paradigms:

  • Ambient Distribution Compression (ADC): Operates on full-dimensional data; BDC is designed to outperform ADC when manifold structure is present.
  • Distributional autoencoding: BDC’s RMMD loss aligns with recent trends in distribution-preserving compression and Bayesian data summarization (Harth-Kitzerow et al., 2020).
  • Distributed Compression and Distributed Detection: In two-node detection systems, BDC-style separation of compression axes aligns with rate-error-distortion tradeoffs (Katz et al., 2016), adaptive entropy bottlenecks (Ulhaq et al., 18 Jun 2024), and bi-directional channel models in federated learning (Egger et al., 31 Jan 2025).

8. Conclusion

Bilateral Distribution Compression operationalizes simultaneous sample and dimensionality reduction subject to rigorous control of decoded distributional error. By minimizing MMD-based discrepancies at embedding, compression, and decoding points, BDC yields coresets that are compact yet highly representative of the original dataset. These properties make BDC suitable for scalable machine learning pipelines, distributional modeling, and robust downstream analysis in both unsupervised and supervised scenarios (Broadbent et al., 22 Sep 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Bilateral Distribution Compression (BDC).