CompVAE: Compositional Variational AutoEncoder

Updated 5 November 2025

CompVAE is a generative model that represents data as sets of elements, ensuring invariance to element order and flexible set sizes.
It uses a GNN-based inference for per-part Gaussian latents and a global latent code to capture interaction nuances.
The architecture supports programmable latent operations like addition and removal, demonstrating robust synthetic reconstructions.

CompVAE (Compositional Variational AutoEncoder) is a generative model designed for data exhibiting a multi-ensemblist structure—datasets where each instance consists of a set or aggregation of elements ("parts") rather than a single vectorial observation. The model is derived from Bayesian variational principles and enables explicit compositionality: it allows for the representation, generation, and manipulation of wholes based on arbitrary combinations of their constituent parts, exhibiting invariance to both the order and the number of elements. CompVAE achieves this by factorizing the generative process and inference so as to support programmable operations—such as addition and removal of elements—in the learned latent space (Berger et al., 2020).

1. Model Structure and Generative Process

The core of CompVAE is a generative model where each observed datapoint $x$ is described as a combination of a (possibly variable-size) set of elements $\{\ell_i\}$ , where each $\ell_i$ is a categorical or symbolic label drawn from a finite set $\Lambda$ . The generative process is specified as follows:

Each part $\ell_i$ is associated with a latent variable $w_i$ .
All part-latents $\{w_i\}$ are aggregated via an order-invariant operation—the sum—to obtain an intermediate latent $\tilde{w} = \sum_i w_i$ .
A global latent code $z$ is sampled conditionally on $\tilde{w}$ , capturing interactions or global dependencies not explained by the sum of per-part latents.
The observation $x$ is sampled from a distribution parameterized by both $z$ and $\tilde{w}$ .

The model factorization is given by: $p_\theta(x, z, \{w_i\} | \{\ell_i\}) = p_\theta(x | z, \tilde{w})\, p_\theta(z | \tilde{w})\, \prod_i p_\theta(w_i | \ell_i),$ where $\tilde{w} = \sum_i w_i$ .

This design ensures that the generation of $x$ is invariant to the order of the elements and robust to the number or composition of the set.

2. Variational Inference and Training

Learning in CompVAE proceeds via maximization of the ELBO (Evidence Lower Bound) using variational inference:

$\mathcal{L}_{ELBO}(x) = \mathbb{E}_{q_\phi(z,\{w_i\}|x, \{\ell_i\})} \left[ \log \frac{q_\phi(z, \{w_i\}|x, \{\ell_i\})}{p_\theta(z |\tilde{w}) \prod_i p_\theta(w_i|\ell_i)} - \log p_\theta(x | z, \tilde{w}) \right]$

The inference model factorizes as: $q_\phi(z, \{w_i\}|x, \{\ell_i\}) = q_\phi(\{w_i\}|z,x, \{\ell_i\})\, q_\phi(z|x)$ with the per-part inference for $\{w_i\}$ modeled by a fully-connected Graph Neural Network (GNN) architecture, ensuring permutation invariance and capturing correlations among part-latents. This structure allows the encoding of sets of arbitrary size and flexible correlation modeling.

The per-part multivariate Gaussian posterior for $\{w_i\}$ is parameterized such that the variance of their sum can be precisely controlled, which is crucial for stable and robust reconstruction. The KL-divergence term in the ELBO is analytic due to the Gaussian form of both factorized and posterior distributions.

3. Compositional Latent Operations and Invariance

The compositional property of CompVAE arises from the sum-aggregation of part-latents. For generation:

Addition of elements: To include a new part, compute its latent $w_j$ and add it into $\tilde{w}$ .
Removal of elements: To remove, simply omit the corresponding $w_k$ from the sum.

This affords post-training programmability; arbitrary operations on the sets of elements translate to interpretable manipulations in data space. The model's generative mechanism is provably invariant to the order of elements and flexible to the set size, supporting subsets or supersets of composition not seen during training.

4. Experimental Validation

CompVAE was empirically validated with synthetic benchmarks designed to test compositionality and reinforcement of invariance properties:

1D Synthetic Problem (Nonlinear Sine Aggregation): Each part is parameterized by frequency, amplitude, and phase, combined nonlinearly by summing sine curves and applying a scaled $\tanh$ nonlinearity. CompVAE demonstrates that the per-part latents $\{w_i\}$ capture nearly all the information, and the global latent $z$ is used minimally, yielding smooth reconstructions under element addition/removal and generalization to set sizes beyond those observed during training.
2D Synthetic Problem (Colored Spots): Here, each part is a colored point placed in an image. CompVAE enables coherent generation under arbitrary addition or removal of spots, with latent representations robust to set size and composition.

In all cases, CompVAE achieves smooth, plausible reconstructions and supports operations like incremental part addition, showing strong information partition between per-part latents and the global interaction code. The model remains robust for scenarios where the number of elements during generation exceeds the training range.

5. Mathematical Formulations and Theoretical Properties

The key mathematical components underpinning CompVAE are:

Generative factorization: explicit partitioning of information into local (per-part) and global (interaction) components.
ELBO Loss: tractable analytic computation from the parameterization of posteriors as (multivariate) Gaussians.
Aggregation invariance: the sum operation confers permutation invariance and generalization to varying set sizes.
GNN-based Inference: Message-passing enables modeling of inter-part relationships while maintaining order- and size-invariance.

These structures collectively support the expressive, compositional, and reprogrammable properties desired in applications involving complex set- or group-structured data.

6. Significance and Use Cases

CompVAE provides a compositional generative approach especially suited for domains where instances are built from sets of elements that can be flexibly combined, such as:

Simulation and control: where scenarios can be parametrically constructed from collections of abstract items.
Vision and graphics: for object-based composition or de-composition.
Energy management and similar aggregate modeling: e.g., aggregating curves for households with variable compositions.

Two major features distinguish CompVAE: robust invariance to the order and number of components, and the capacity for controlled, interpretable generation by latent modifications. These properties are empirically validated in the synthetic experiments reported.

7. Summary Table: Key Properties of CompVAE

Property	CompVAE Implementation
Data structure handled	Sets of elements (multi-ensemblist)
Compositionality	Yes (add/remove parts in latent)
Order invariance	Yes (sum-aggregation)
Size robustness	Generalizes to unseen set sizes
Division of information	Per-part and global latent codes
Inference architecture	Permutation-invariant GNN

CompVAE thus constitutes a theoretically principled and practically validated VAE extension for compositional, programmable, and invariant generative modeling of set-structured data, enabling novel forms of controllable synthesis and simulation beyond those accessible to standard VAEs (Berger et al., 2020).

PDF Markdown Chat (Pro)

References (1)

From abstract items to latent spaces to observed data and back: Compositional Variational Auto-Encoder (2020)

Follow Topic

Get notified by email when new papers are published related to CompVAE.