Papers
Topics
Authors
Recent
Search
2000 character limit reached

InterTransformer: Universal Measure-to-Measure Model

Updated 13 December 2025
  • InterTransformer is defined as a Transformer-style architecture that evolves arbitrary probability measures on the unit sphere via time-dependent self-attention and nonlinear interactions.
  • It employs a continuity equation with explicit parameterization to jointly match multiple source and target measures in a mathematically rigorous framework.
  • The architecture integrates disentanglement, clustering, and interpolation phases to ensure universal approximation and flexible adaptation to diverse input distributions.

The InterTransformer is a Transformer-style architecture formalizing neural networks as measure-to-measure maps, generalizing the standard point-to-point paradigm and providing explicit universal approximation capabilities for flows between probability measures defined on the unit sphere. Unlike classical architectures, which operate on fixed-size vector inputs, the InterTransformer evolves empirical or arbitrary measures via a continuity equation with a time-dependent vector field driven by self-attention and nonlinear interactions. The explicit construction realizes joint matching of any finite collection of source and target measures by parameterizing the flow to implement ensemble transport under minimal structural assumptions.

1. Mathematical Definition and Continuity Flow

Operating on the unit sphere S=Sd−1S = \mathbb{S}^{d-1}, the InterTransformer is parameterized by a time-dependent control

θ=(A(t),B(t),W(t),C(t),b(t))t∈[0,T]∈L∞([0,T];(Rd×d)4×Rd).\theta = (A(t), B(t), W(t), C(t), b(t))_{t \in [0,T]} \in L^\infty([0,T]; (\mathbb{R}^{d \times d})^4 \times \mathbb{R}^d).

Given an initial measure μ0∈P(S)\mu_0 \in \mathcal{P}(S), the flow μ(t)\mu(t) is governed by the continuity equation:

∂tμ(t)+∇⋅(μ(t) v[μ(t)])=0,μ(0)=μ0,\partial_t \mu(t) + \nabla \cdot \left( \mu(t) \, v[\mu(t)] \right) = 0, \quad \mu(0) = \mu_0,

where the vector field v[μ](t,x)v[\mu](t,x) at x∈Sx \in S is specified as

v[μ](t,x)=Px(A(t)A[μ](t,x)+W(t)(B(t)x+b(t))+),v[\mu](t,x) = P_x \left( A(t)\mathscr{A}[\mu](t,x) + W(t)(B(t)x + b(t))_+ \right),

with Px=I−xx⊤P_x = I - xx^\top the projection onto TxST_x S and the self-attention map

A[μ](t,x)=∫Se⟨A(t)x,x′⟩x′μ(dx′)∫Se⟨A(t)x,x′⟩μ(dx′).\mathscr{A}[\mu](t,x) = \frac{\int_S e^{\langle A(t)x, x' \rangle}x' \mu(dx')}{\int_S e^{\langle A(t)x, x' \rangle}\mu(dx')}.

The solution map at time TT is denoted ΦθT:P(S)→P(S)\Phi^T_\theta : \mathcal{P}(S) \to \mathcal{P}(S), with ΦθT(μ0)=μ(T)\Phi^T_\theta(\mu_0) = \mu(T).

2. Particle-Based Realization and Empirical Measures

For discrete empirical measures,

μ0=1n∑i=1nδxi(0),xi(0)∈S,\mu_0 = \frac{1}{n}\sum_{i=1}^n \delta_{x_i(0)}, \quad x_i(0) \in S,

the InterTransformer realizes the standard Transformer layer as the dynamical system:

x˙i(t)=v[μ(t)](t,xi(t)),μ(t)=1n∑i=1nδxi(t).\dot{x}_i(t) = v\left[ \mu(t) \right]\left(t, x_i(t)\right), \quad \mu(t) = \frac{1}{n} \sum_{i=1}^n \delta_{x_i(t)}.

As n→∞n \to \infty, law of large numbers yields the mean-field PDE description above, demonstrating that Transformers admit a natural formulation as maps μ0↦μ(T)\mu_0 \mapsto \mu(T) between measures. This framework encompasses both empirical (token-based) and arbitrary input measures, unifying practical implementations and theoretical analysis.

3. Explicit Measure-to-Measure Matching Construction

Given NN input-output measure pairs {(μ0i,μ1i)}i=1N⊂P(S)×P(S)\{ (\mu_0^i, \mu_1^i) \}_{i=1}^N \subset \mathcal{P}(S) \times \mathcal{P}(S), suppose each pair admits a measurable transport map Ti:S→ST^i : S \to S with T#iμ0i=μ1iT^i_\# \mu_0^i = \mu_1^i. Let d≥3d \geq 3, and require that all inputs and outputs are pairwise distinguishable. The Universal Measure-to-Measure Matching Theorem states:

For any T>0T > 0 and any ε>0\varepsilon > 0, there exists a piecewise-constant parameter θ∈L∞([0,T];(Rd×d)4×Rd)\theta \in L^\infty([0,T]; (\mathbb{R}^{d \times d})^4 \times \mathbb{R}^d) so that the unique solutions μi(t)\mu^i(t) to the continuity equation with μi(0)=μ0i\mu^i(0) = \mu_0^i obey

W2(μi(T),μ1i)<ε,i=1,...,N.\mathcal{W}_2 \left( \mu^i(T), \mu_1^i \right) < \varepsilon, \qquad i = 1, ..., N.

The construction proceeds in three explicit phases:

Phase Parameters Activated Function
Disentanglement A(t)A(t) (self-attention) Drives means of all NN source measures to disjoint regions
Clustering B(t),W(t),b(t)B(t), W(t), b(t) (perceptron) Aggregates mass in each disentangled region to atoms
Interpolation B(t),W(t),b(t)B(t), W(t), b(t) (perceptron) Transports each atomic cluster exactly to the desired targets

Each stage uses piecewise-constant-in-time parameters, collectively forming a constructive scheme for simultaneous matching.

4. Nonlinear Expressivity and Theoretical Significance

Crucial to the InterTransformer’s universality is the nonlinear dependence on μ\mu in the self-attention module A[μ]\mathscr{A}[\mu]. Linear dynamics cannot disentangle overlapping input clouds. The self-attention term mediates non-linear separation and clustering of distributions, while the ReLU perceptron component ensures universal approximation of maps on the sphere. These mechanisms jointly provide the ability to achieve arbitrary measure-to-measure transport, conditional on the existence of a Monge map for each pair.

The necessity of the minimal transport map assumption reflects the nature of Lipschitz flows; without a pushforward map TiT^i from μ0i\mu_0^i to μ1i\mu_1^i, exact matching cannot be achieved.

5. Variants, Adaptations, and Extensions

The underlying construction generalizes to settings beyond Wasserstein-2 distance: Kullback–Leibler matching is attainable using Pinsker’s inequality and total variation approximations, and ensemble transport with unequal numbers of atomic elements is approximable through discrete-discrete matching. Architectural variants—such as multi-head attention, alternative nonlinearities, or modifications to feed-forward widths—are compatible with the analytic framework, indicating flexibility in model design while preserving universal measure-matching capabilities; modifications may require trading network depth for width.

6. Schematic Illustration and Ensemble Flow

The methodology is illustrated by evolving each empirical source measure through three chronological phases:

  • Disentanglement: Self-attention moves each μ0i\mu_0^i into separated caps around points ziz_i on SS.
  • Clustering: Each cap collapses to an approximate point mass via perceptron flow.
  • Interpolation: Each cluster moves to the corresponding μ1i\mu_1^i.

In the general case with multiple input-output pairs, the construction executes all three phases in parallel. Careful parameter reuse and synchronization yield a single time-dependent InterTransformer that matches all NN measure pairs concurrently. This demonstrates the capacity of Transformer-style flows to perform ensemble-level de-mixing, clustering, and targeted interpolation, substantiating their role as universal measure-to-measure operators rather than only pointwise function approximators (Geshkovski et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InterTransformer.