Generalization Bounds and Statistical Guarantees for Multi-Task and Multiple Operator Learning with MNO Networks

Published 2 Apr 2026 in cs.LG | (2604.01961v1)

Abstract: Multiple operator learning concerns learning operator families ${G[α]:U\to V}{α\in W}$ indexed by an operator descriptor $α$. Training data are collected hierarchically by sampling operator instances $α$, then input functions $u$ per instance, and finally evaluation points $x$ per input, yielding noisy observations of $G[α]u$. While recent work has developed expressive multi-task and multiple operator learning architectures and approximation-theoretic scaling laws, quantitative statistical generalization guarantees remain limited. We provide a covering-number-based generalization analysis for separable models, focusing on the Multiple Neural Operator (MNO) architecture: we first derive explicit metric-entropy bounds for hypothesis classes given by linear combinations of products of deep ReLU subnetworks, and then combine these complexity bounds with approximation guarantees for MNO to obtain an explicit approximation-estimation tradeoff for the expected test error on new (unseen) triples $(α,u,x)$. The resulting bound makes the dependence on the hierarchical sampling budgets $(nα,n_u,n_x)$ transparent and yields an explicit learning-rate statement in the operator-sampling budget $n_α$, providing a sample-complexity characterization for generalization across operator instances. The structure and architecture can also be viewed as a general purpose solver or an example of a "small'' PDE foundation model, where the triples are one form of multi-modality.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper establishes the first generalization error bounds for MNO networks, rigorously decomposing approximation and estimation errors with hierarchical sampling.
It develops explicit covering number bounds for separable deep architectures and quantifies how operator instance sampling (nα) critically influences learning rates.
The results guide optimal data acquisition strategies and emphasize network expressivity trade-offs, key for multi-task operator learning applications.

Generalization Bounds and Statistical Guarantees for Multi-Task and Multiple Operator Learning with MNO Networks

Introduction and Problem Setting

This paper addresses the statistical generalization behavior of multi-task and multiple operator learning, specifically for neural operator architectures of the Multiple Neural Operator (MNO) class. The central object is a family of operators $\{G[\alpha] : U \to V\}_{\alpha \in W}$ mapping input functions $u \in U$ to output functions via a descriptor $\alpha \in W$ , with $U,V,W$ being Banach or function spaces. Data are generated hierarchically: first operator instances $\alpha$ are sampled, next input functions $u$ are sampled for each $\alpha$ , and then output evaluations are queried at specific points $x$ , possibly with noise. This paradigm encompasses parametric integral operators, solution operators for parameterized PDEs, and more general cases where the descriptor $\alpha$ encodes symbolic or textual task information.

MNOs instantiate operator families as neural networks with a separable product structure. For generalization analysis, the challenge is to derive explicit expected error bounds for predictions on new unseen triples $(\alpha,u,x)$ , quantifying the transfer to new operators, new functions, and new query locations.

Main Theoretical Contributions

Covering Numbers and Metric Entropy for Separable Architectures

The first technical layer establishes explicit upper bounds for the metric entropy (in covering number sense) of hypothesis spaces $u \in U$ 0 comprising networks of the MNO/separable form: $u \in U$ 1 where each subnetwork ( $u \in U$ 2) is a deep ReLU network with specified width, depth, sparsity, and parameter bounds. The analysis shows that the covering number can be controlled explicitly in terms of these network parameters, the number of summands, and the admissible parameter range.

Generalization Bound and Explicit Learning Rate in $u \in U$ 3

The main result is the derivation of an explicit generalization error bound for empirical risk minimizers over MNO hypothesis classes. The key theorem specifies that for any prescribed target accuracy $u \in U$ 4 and covering scale $u \in U$ 5, the expected test error can be decomposed into:

an approximation term controlled by MNO expressivity,
estimation terms governed by $u \in U$ 6,
explicit dependence on data sample sizes $u \in U$ 7.

The dominant learning rate in the operator-sampling regime ( $u \in U$ 8) is of the form: $u \in U$ 9 where $\alpha \in W$ 0 is the effective (intrinsic) dimension of the operator-descriptor space. This result evidences a "bottleneck" caused by the complexity of generalizing to unseen operator instances $\alpha \in W$ 1.

Sample-Complexity Interpretation

The covering-number-based bound exposes a transparent tradeoff between model approximation accuracy, network capacity, and the sample complexity at each hierarchical layer. Notably, the dominant sample-complexity term is in $\alpha \in W$ 2, i.e., the number of distinct operator instances, with parallel amortization over $\alpha \in W$ 3 and $\alpha \in W$ 4 for each $\alpha \in W$ 5. Excessive sampling in $\alpha \in W$ 6 or $\alpha \in W$ 7 cannot offset a paucity of operator instances.

Architectural and Analytical Framework

Separable Expansions and Network Structure

The theoretical framework analyzes hypothesis classes parameterized as linear combinations of products of independent deep subnetworks over $\alpha \in W$ 8, $\alpha \in W$ 9, and $U,V,W$ 0. The analysis is done in a "clipped" regime, enforcing bounded output to control metric entropy, with the approximation theory showing these clipped MNOs retain universal approximation properties for Lipschitz operator families.

Hierarchical Sampling and Its Statistical Impact

The analysis formalizes the multi-level sampling procedure: operator instances $U,V,W$ 1 (possibly task-discrete or continuous), input functions $U,V,W$ 2, and spatial points $U,V,W$ 3. The bound decomposes estimation error contributions for each sampling layer, giving practitioners concrete guidance on where increasing data collection will yield marginal improvement.

The framework directly generalizes single-operator operator learning bounds for DeepONet, FNO, and similar architectures, extending them to the multi-task regime. Comparison with works such as "A Deep Learning Framework for Multi-Operator Learning: Architectures and Approximation Theory" (Weihs et al., 29 Oct 2025) and those on PDE foundation models [sun2025foundation, liu2024prosefd] highlights the increased complexity of the MNO-hierarchy but provides rigorous bounds that support empirical findings from those works.

Universal approximation and expressivity scaling, as previously developed for DeepONet and its kin [liu2024neuralscalinglawsdeep, Kovachki2021], are shown to be compatible with the statistical regime considered here. The results yield rates that can be contrasted with the minimax rates for nonparametric function estimation but are adapted to the compositional structure and multi-level nature of operator families.

Strong Claims and Contrasts

The paper makes several explicit, non-trivial claims:

First generalization bound for multiple-operator learning with deep neural operators in terms of covering numbers.
Explicit separation of approximation and estimation errors, with dominant sample complexity dictated by the operator descriptor space.
For MNOs, increasing $U,V,W$ 4 is essential and cannot be compensated for by increasing $U,V,W$ 5 or $U,V,W$ 6 asymptotically.

These highlight fundamental distinctions with network architectures that do not exploit operator conditional structure (e.g., independent training per-task).

Implications for Theory and Practice

Practical Considerations

The results deliver sharp sample-complexity guidance for practitioners developing foundation models for scientific simulation, inverse problems, or multimodal operator learning. In particular:

Hierarchical multi-task training should emphasize coverage/diversity in operator space; training "wider" on tasks is more statistically valuable than "deeper" per-task.
MNO and similar architectures may be viewed as a class of "small foundation models" for PDEs, bridging classical numerical analysis (e.g., Green’s functions, kernel methods) and modern representation learning.

Theoretical Outlook

The analysis suggests several theoretical directions:

Refinement via localized or data-dependent complexity measures may improve on the presented global metric entropy control.
Extension of the covering number/metric entropy framework to attention-based or transformer-based operator learners could corroborate or challenge these findings for non-separable architectures.
The learning rate being (poly-)logarithmic in $U,V,W$ 7 is sub-optimal (as compared to minimax rates for finite-dimensional parametric families), but is intrinsic to the high-dimensional and function-valued nature of the operator learning problem.

Prospects for AI and Operator Learning

As multi-task and foundation models for physical simulation proliferate, theoretical underpinnings such as those provided in this work are essential for understanding the limits of generalization, rather than solely empirical performance. The formal sample-complexity expressions derived here can inform both architectural design and data acquisition prioritization strategies in scientific machine learning pipelines.

A further implication is for the success of promptable or conditional PDE solvers: explicit conditioning on interpretable operator descriptors, in a way consistent with the multi-level metric entropy bounds, is essential to ensure non-degenerate generalization out-of-distribution in operator space.

Conclusion

This paper provides a rigorous statistical learning theory for multiple-operator neural networks, notably MNOs, in hierarchically sampled multi-task regimes. It derives explicit estimation/approximation tradeoffs, covering number bounds for separable deep architectures, and dominant learning rates for generalization to unseen operators. The analysis clarifies both the promise and limitations of current architectures for large-scale, multi-task scientific foundation modeling, providing a baseline for future architectural and theoretical improvements in operator-based machine learning (2604.01961).

Markdown Report Issue