InterTransformer: Universal Measure-to-Measure Model

Updated 13 December 2025

InterTransformer is defined as a Transformer-style architecture that evolves arbitrary probability measures on the unit sphere via time-dependent self-attention and nonlinear interactions.
It employs a continuity equation with explicit parameterization to jointly match multiple source and target measures in a mathematically rigorous framework.
The architecture integrates disentanglement, clustering, and interpolation phases to ensure universal approximation and flexible adaptation to diverse input distributions.

The InterTransformer is a Transformer-style architecture formalizing neural networks as measure-to-measure maps, generalizing the standard point-to-point paradigm and providing explicit universal approximation capabilities for flows between probability measures defined on the unit sphere. Unlike classical architectures, which operate on fixed-size vector inputs, the InterTransformer evolves empirical or arbitrary measures via a continuity equation with a time-dependent vector field driven by self-attention and nonlinear interactions. The explicit construction realizes joint matching of any finite collection of source and target measures by parameterizing the flow to implement ensemble transport under minimal structural assumptions.

1. Mathematical Definition and Continuity Flow

Operating on the unit sphere $S = \mathbb{S}^{d-1}$ , the InterTransformer is parameterized by a time-dependent control

$\theta = (A(t), B(t), W(t), C(t), b(t))_{t \in [0,T]} \in L^\infty([0,T]; (\mathbb{R}^{d \times d})^4 \times \mathbb{R}^d).$

Given an initial measure $\mu_0 \in \mathcal{P}(S)$ , the flow $\mu(t)$ is governed by the continuity equation:

$\partial_t \mu(t) + \nabla \cdot \left( \mu(t) \, v[\mu(t)] \right) = 0, \quad \mu(0) = \mu_0,$

where the vector field $v[\mu](t,x)$ at $x \in S$ is specified as

$v[\mu](t,x) = P_x \left( A(t)\mathscr{A}[\mu](t,x) + W(t)(B(t)x + b(t))_+ \right),$

with $P_x = I - xx^\top$ the projection onto $T_x S$ and the self-attention map

$\mathscr{A}[\mu](t,x) = \frac{\int_S e^{\langle A(t)x, x' \rangle}x' \mu(dx')}{\int_S e^{\langle A(t)x, x' \rangle}\mu(dx')}.$

The solution map at time $T$ is denoted $\Phi^T_\theta : \mathcal{P}(S) \to \mathcal{P}(S)$ , with $\Phi^T_\theta(\mu_0) = \mu(T)$ .

2. Particle-Based Realization and Empirical Measures

For discrete empirical measures,

$\mu_0 = \frac{1}{n}\sum_{i=1}^n \delta_{x_i(0)}, \quad x_i(0) \in S,$

the InterTransformer realizes the standard Transformer layer as the dynamical system:

$\dot{x}_i(t) = v\left[ \mu(t) \right]\left(t, x_i(t)\right), \quad \mu(t) = \frac{1}{n} \sum_{i=1}^n \delta_{x_i(t)}.$

As $n \to \infty$ , law of large numbers yields the mean-field PDE description above, demonstrating that Transformers admit a natural formulation as maps $\mu_0 \mapsto \mu(T)$ between measures. This framework encompasses both empirical (token-based) and arbitrary input measures, unifying practical implementations and theoretical analysis.

3. Explicit Measure-to-Measure Matching Construction

Given $N$ input-output measure pairs $\{ (\mu_0^i, \mu_1^i) \}_{i=1}^N \subset \mathcal{P}(S) \times \mathcal{P}(S)$ , suppose each pair admits a measurable transport map $T^i : S \to S$ with $T^i_\# \mu_0^i = \mu_1^i$ . Let $d \geq 3$ , and require that all inputs and outputs are pairwise distinguishable. The Universal Measure-to-Measure Matching Theorem states:

For any $T > 0$ and any $\varepsilon > 0$ , there exists a piecewise-constant parameter $\theta \in L^\infty([0,T]; (\mathbb{R}^{d \times d})^4 \times \mathbb{R}^d)$ so that the unique solutions $\mu^i(t)$ to the continuity equation with $\mu^i(0) = \mu_0^i$ obey

$\mathcal{W}_2 \left( \mu^i(T), \mu_1^i \right) < \varepsilon, \qquad i = 1, ..., N.$

The construction proceeds in three explicit phases:

Phase	Parameters Activated	Function
Disentanglement	$A(t)$ (self-attention)	Drives means of all $N$ source measures to disjoint regions
Clustering	$B(t), W(t), b(t)$ (perceptron)	Aggregates mass in each disentangled region to atoms
Interpolation	$B(t), W(t), b(t)$ (perceptron)	Transports each atomic cluster exactly to the desired targets

Each stage uses piecewise-constant-in-time parameters, collectively forming a constructive scheme for simultaneous matching.

4. Nonlinear Expressivity and Theoretical Significance

Crucial to the InterTransformer’s universality is the nonlinear dependence on $\mu$ in the self-attention module $\mathscr{A}[\mu]$ . Linear dynamics cannot disentangle overlapping input clouds. The self-attention term mediates non-linear separation and clustering of distributions, while the ReLU perceptron component ensures universal approximation of maps on the sphere. These mechanisms jointly provide the ability to achieve arbitrary measure-to-measure transport, conditional on the existence of a Monge map for each pair.

The necessity of the minimal transport map assumption reflects the nature of Lipschitz flows; without a pushforward map $T^i$ from $\mu_0^i$ to $\mu_1^i$ , exact matching cannot be achieved.

5. Variants, Adaptations, and Extensions

The underlying construction generalizes to settings beyond Wasserstein-2 distance: Kullback–Leibler matching is attainable using Pinsker’s inequality and total variation approximations, and ensemble transport with unequal numbers of atomic elements is approximable through discrete-discrete matching. Architectural variants—such as multi-head attention, alternative nonlinearities, or modifications to feed-forward widths—are compatible with the analytic framework, indicating flexibility in model design while preserving universal measure-matching capabilities; modifications may require trading network depth for width.

6. Schematic Illustration and Ensemble Flow

The methodology is illustrated by evolving each empirical source measure through three chronological phases:

Disentanglement: Self-attention moves each $\mu_0^i$ into separated caps around points $z_i$ on $S$ .
Clustering: Each cap collapses to an approximate point mass via perceptron flow.
Interpolation: Each cluster moves to the corresponding $\mu_1^i$ .

In the general case with multiple input-output pairs, the construction executes all three phases in parallel. Careful parameter reuse and synchronization yield a single time-dependent InterTransformer that matches all $N$ measure pairs concurrently. This demonstrates the capacity of Transformer-style flows to perform ensemble-level de-mixing, clustering, and targeted interpolation, substantiating their role as universal measure-to-measure operators rather than only pointwise function approximators (Geshkovski et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Measure-to-measure interpolation using Transformers (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InterTransformer.