Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep Sets Theorem

Updated 17 January 2026
  • Deep Sets Theorem is a rigorous formulation defining permutation-invariant set functions via sum pooling and continuous mappings.
  • The theorem decomposes any invariant function into per-element embeddings aggregated by summation, ensuring uniqueness and universality.
  • It also establishes the structure of permutation-equivariant functions, motivating parameter-sharing schemes in modern deep learning architectures.

In contemporary machine learning, numerous tasks involve inputs that are most naturally represented as finite sets, rather than as ordered, fixed-length vectors. Such domains include point-cloud classification, estimation of population statistics, set-expansion, outlier detection, and related scientific and industrial applications. The defining requirement for any function applied to such sets is permutation invariance—its output must not depend on the order of elements within the set. The Deep Sets theorem, introduced by Zaheer et al. (Zaheer et al., 2017), offers a full characterization of permutation-invariant set functions and prescribes universal neural network architectures that exactly respect this symmetry. In parallel, the theorem delineates the structure of permutation-equivariant functions, which produce per-element outputs that transform compatibly under input re-ordering.

1. Permutation-Invariant Set Functions

Let X\mathfrak{X} denote the ground set or universe of possible elements (e.g., X=Rd\mathfrak{X} = \mathbb{R}^d), and let X={x1,,xm}XX = \{x_1, \ldots, x_m\} \subset \mathfrak{X} be a finite input set. Define 2X2^\mathfrak{X} as the collection of all finite subsets of X\mathfrak{X}. A function f:2XYf : 2^\mathfrak{X} \to \mathcal{Y} is called permutation invariant if for every finite set XX and every permutation π\pi of {1,,m}\{1,\ldots,m\}, it holds that

f({x1,,xm})=f({xπ(1),,xπ(m)}).f(\{x_1, \ldots, x_m\}) = f(\{x_{\pi(1)}, \ldots, x_{\pi(m)}\}).

The Deep Sets theorem states that any continuous permutation-invariant function f:[0,1]MRf : [0,1]^M \to \mathbb{R} can be decomposed as

f({x1,,xM})=ρ(m=1Mϕ(xm)),f(\{x_1, \ldots, x_M\}) = \rho\left( \sum_{m=1}^M \phi(x_m) \right),

where ϕ:RRM+1\phi : \mathbb{R} \to \mathbb{R}^{M+1} and ρ:RM+1R\rho : \mathbb{R}^{M+1} \to \mathbb{R} are continuous mappings. In the more general countable-universe case, this representation is exact for all set functions f:2XRf : 2^\mathfrak{X} \to \mathbb{R} (without requiring continuity), with suitable choices of ϕ\phi and ρ\rho.

2. Proof Structure and Key Ingredients

The proof comprises two critical components:

A. Uniqueness of 'sum-of-embeddings' signature:

— For countable universes, one can construct ϕ(x)\phi(x) as, e.g., 4c(x)4^{-c(x)} for a fixed enumeration c:XNc : \mathfrak{X} \to \mathbb{N}. The sum xXϕ(x)\sum_{x \in X} \phi(x) uniquely encodes the set XX, allowing ρ\rho to reconstruct f(X)f(X).

— For fixed-size sets over continuous domains, the embedding is ϕ(x)=[1,x,x2,,xM]RM+1\phi(x) = [1, x, x^2, \ldots, x^M] \in \mathbb{R}^{M+1}. The pooled vector E(X)E(X),

E(X):=m=1Mϕ(xm)=[M,xm,xm2,,xmM],E(X) := \sum_{m=1}^M \phi(x_m) = [M, \sum x_m, \sum x_m^2, \ldots, \sum x_m^M],

uniquely determines the multiset {x1,,xm}\{x_1,\ldots,x_m\} up to permutation by Newton–Girard identities and is a continuous bijection when the set is sorted, with continuous inverse.

B. Construction of ρ\rho:

With a continuous bijection E:SortedSetsRM+1E : \mathrm{SortedSets} \rightarrow \mathbb{R}^{M+1}, define ρ(z):=f(E1(z))\rho(z) := f(E^{-1}(z)), which is automatically continuous, yielding f(X)=ρ(ϕ(x))f(X) = \rho(\sum \phi(x)) for any set XX.

3. Permutation-Equivariant Functions

For mappings f:XMYMf : \mathfrak{X}^M \rightarrow \mathcal{Y}^M that commute with input permutations, i.e.,

f([xπ(1),...,xπ(M)])=[f(x)π(1),...,f(x)π(M)],f([x_{\pi(1)}, ..., x_{\pi(M)}]) = [f(x)_{{\pi(1)}}, ..., f(x)_{{\pi(M)}}],

the necessary and sufficient condition (for a single layer of the form fΘ(x)=σ(Θx)f_\Theta(x) = \sigma(\Theta x), σ\sigma activation, xRMx \in \mathbb{R}^M) is for ΘRM×M\Theta \in \mathbb{R}^{M \times M} to commute with all permutation matrices. By group theory, this restricts Θ\Theta to the form

Θ=λI+γ11T,\Theta = \lambda I + \gamma 11^T,

where λ,γR\lambda, \gamma \in \mathbb{R}, II is the identity, and 11T11^T is the all-ones matrix. This yields

f(x)i=σ(λxi+γj=1Mxj).f(x)_i = \sigma(\lambda x_i + \gamma \sum_{j=1}^M x_j).

For general xRM×Dx \in \mathbb{R}^{M \times D} and output in RM×D\mathbb{R}^{M \times D'}, the layer generalizes to

f(X)m=σ(xmΛ+(j=1Mxj)Γ),f(X)_m = \sigma\left( x_m \Lambda + (\sum_{j=1}^M x_j) \Gamma \right),

where Λ,ΓRD×D\Lambda, \Gamma \in \mathbb{R}^{D \times D'}.

4. Neural Architectures for Sets

Invariant models:

— Apply a shared feedforward (φ-)network, ϕ:XRk\phi : \mathfrak{X} \rightarrow \mathbb{R}^k, to each set element. — Aggregate via sum-pooling, s=m=1Mϕ(xm)s = \sum_{m=1}^M \phi(x_m). — A second (ρ-)network, ρ:RkY\rho : \mathbb{R}^k \rightarrow \mathcal{Y}, processes the pooled vector.

Given sufficient expressivity of ϕ\phi and ρ\rho (e.g., multilayer MLPs), the architecture can approximate any continuous invariant function.

Equivariant models:

— Use layers of the form

hm(+1)=σ(λ()hm()+γ()jhj()+b()),h_m^{(\ell+1)} = \sigma(\lambda^{(\ell)} h_m^{(\ell)} + \gamma^{(\ell)} \sum_j h_j^{(\ell)} + b^{(\ell)} ),

where λ,γ,b\lambda, \gamma, b are scalars or small matrices; equivariance is preserved across stacked layers. While mean and max pooling are also commutative and used in practice, only sum pooling guarantees universality per the main theorem.

5. Universality and Architectural Necessity

The theorem uniquely identifies sum-pooling over elementwise embeddings as the only mechanism—modulo mild technicalities—for universal function approximation subject to permutation symmetry. Approaches such as RNN processing, arbitrary post-pooling with non-shared weights, or applying fully connected layers before pooling either violate permutation invariance or fail to be universal. The parameter-sharing structure λI+γ11T\lambda I + \gamma 11^T for equivariant outputs underpins not only Deep Sets architectures but has also motivated subsequent frameworks such as graph neural networks (which aggregate via neighbor summations) and point-cloud networks (e.g., PointNet).

In summary, the Deep Sets theorem establishes that permutation symmetry is a necessary and sufficient condition for universal set processing, with immediate consequences for deep network design in set-based learning tasks (Zaheer et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
1.
Deep Sets  (2017)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Sets Theorem.