InterTransformer: Universal Measure-to-Measure Model
- InterTransformer is defined as a Transformer-style architecture that evolves arbitrary probability measures on the unit sphere via time-dependent self-attention and nonlinear interactions.
- It employs a continuity equation with explicit parameterization to jointly match multiple source and target measures in a mathematically rigorous framework.
- The architecture integrates disentanglement, clustering, and interpolation phases to ensure universal approximation and flexible adaptation to diverse input distributions.
The InterTransformer is a Transformer-style architecture formalizing neural networks as measure-to-measure maps, generalizing the standard point-to-point paradigm and providing explicit universal approximation capabilities for flows between probability measures defined on the unit sphere. Unlike classical architectures, which operate on fixed-size vector inputs, the InterTransformer evolves empirical or arbitrary measures via a continuity equation with a time-dependent vector field driven by self-attention and nonlinear interactions. The explicit construction realizes joint matching of any finite collection of source and target measures by parameterizing the flow to implement ensemble transport under minimal structural assumptions.
1. Mathematical Definition and Continuity Flow
Operating on the unit sphere , the InterTransformer is parameterized by a time-dependent control
Given an initial measure , the flow is governed by the continuity equation:
where the vector field at is specified as
with the projection onto and the self-attention map
The solution map at time is denoted , with .
2. Particle-Based Realization and Empirical Measures
For discrete empirical measures,
the InterTransformer realizes the standard Transformer layer as the dynamical system:
As , law of large numbers yields the mean-field PDE description above, demonstrating that Transformers admit a natural formulation as maps between measures. This framework encompasses both empirical (token-based) and arbitrary input measures, unifying practical implementations and theoretical analysis.
3. Explicit Measure-to-Measure Matching Construction
Given input-output measure pairs , suppose each pair admits a measurable transport map with . Let , and require that all inputs and outputs are pairwise distinguishable. The Universal Measure-to-Measure Matching Theorem states:
For any and any , there exists a piecewise-constant parameter so that the unique solutions to the continuity equation with obey
The construction proceeds in three explicit phases:
| Phase | Parameters Activated | Function |
|---|---|---|
| Disentanglement | (self-attention) | Drives means of all source measures to disjoint regions |
| Clustering | (perceptron) | Aggregates mass in each disentangled region to atoms |
| Interpolation | (perceptron) | Transports each atomic cluster exactly to the desired targets |
Each stage uses piecewise-constant-in-time parameters, collectively forming a constructive scheme for simultaneous matching.
4. Nonlinear Expressivity and Theoretical Significance
Crucial to the InterTransformer’s universality is the nonlinear dependence on in the self-attention module . Linear dynamics cannot disentangle overlapping input clouds. The self-attention term mediates non-linear separation and clustering of distributions, while the ReLU perceptron component ensures universal approximation of maps on the sphere. These mechanisms jointly provide the ability to achieve arbitrary measure-to-measure transport, conditional on the existence of a Monge map for each pair.
The necessity of the minimal transport map assumption reflects the nature of Lipschitz flows; without a pushforward map from to , exact matching cannot be achieved.
5. Variants, Adaptations, and Extensions
The underlying construction generalizes to settings beyond Wasserstein-2 distance: Kullback–Leibler matching is attainable using Pinsker’s inequality and total variation approximations, and ensemble transport with unequal numbers of atomic elements is approximable through discrete-discrete matching. Architectural variants—such as multi-head attention, alternative nonlinearities, or modifications to feed-forward widths—are compatible with the analytic framework, indicating flexibility in model design while preserving universal measure-matching capabilities; modifications may require trading network depth for width.
6. Schematic Illustration and Ensemble Flow
The methodology is illustrated by evolving each empirical source measure through three chronological phases:
- Disentanglement: Self-attention moves each into separated caps around points on .
- Clustering: Each cap collapses to an approximate point mass via perceptron flow.
- Interpolation: Each cluster moves to the corresponding .
In the general case with multiple input-output pairs, the construction executes all three phases in parallel. Careful parameter reuse and synchronization yield a single time-dependent InterTransformer that matches all measure pairs concurrently. This demonstrates the capacity of Transformer-style flows to perform ensemble-level de-mixing, clustering, and targeted interpolation, substantiating their role as universal measure-to-measure operators rather than only pointwise function approximators (Geshkovski et al., 2024).