Learning Without Training

Published 20 Feb 2026 in cs.LG and stat.ML | (2602.17985v1)

Abstract: Machine learning is at the heart of managing the real-world problems associated with massive data. With the success of neural networks on such large-scale problems, more research in machine learning is being conducted now than ever before. This dissertation focuses on three different projects rooted in mathematical theory for machine learning applications. The first project deals with supervised learning and manifold learning. In theory, one of the main problems in supervised learning is that of function approximation: that is, given some data set $\mathcal{D}={(x_j,f(x_j))}_{j=1}^M$, can one build a model $F\approx f$? We introduce a method which aims to remedy several of the theoretical shortcomings of the current paradigm for supervised learning. The second project deals with transfer learning, which is the study of how an approximation process or model learned on one domain can be leveraged to improve the approximation on another domain. We study such liftings of functions when the data is assumed to be known only on a part of the whole domain. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related. The third project is concerned with the classification task in machine learning, particularly in the active learning paradigm. Classification has often been treated as an approximation problem as well, but we propose an alternative approach leveraging techniques originally introduced for signal separation problems. We introduce theory to unify signal separation with classification and a new algorithm which yields competitive accuracy to other recent active learning algorithms while providing results much faster.

Abstract PDF Upgrade to Chat

Summary

The paper presents a unified approximation-theoretic framework that bypasses traditional ERM by constructing direct data-driven approximations.
It introduces a novel constructive method on unknown manifolds using spherical harmonics to yield localized error bounds without iterative optimization.
The work extends its methodology to transfer learning and classification via support estimation, achieving competitive accuracy with reduced labeling costs.

Learning Without Training: A Theoretical and Algorithmic Paradigm Shift in Machine Learning

Introduction

The dissertation "Learning Without Training" (2602.17985) introduces a unified approximation-theoretic framework for a spectrum of machine learning problems—supervised learning, transfer learning, and classification. The author posits that contemporary ML practice, while empirically successful, is stunted by its reliance on existential (non-constructive) approximation guarantees and empirical risk minimization (ERM) procedures that obscure the connection between mathematical smoothness, data geometry, and algorithmic performance. Through a sequence of mathematically rigorous constructions, the work proposes and analyzes algorithms that “learn without training”—circumventing optimization by constructing direct, explicit approximations from data.

Critique of Current Supervised Learning Paradigm

A central thesis of the dissertation is a comprehensive critique of ERM as an operationalization of supervised learning. Standard approaches select a hypothesis space $V_n$ , perform empirical risk minimization relative to a global loss functional (e.g., MSE), and fit the finite training data—typically via iterative optimization. Theoretical justifications often rest on universal approximation and degree of approximation results (see, e.g., Barron and Sobolev space rates), but these results are existential and assume knowledge of the data domain and the smoothness of $f$ . Moreover, performing optimization in high-dimensional, noisy settings faces issues of convergence, instability, and sensitivity to initialization, and global loss minimization is insensitive to local artifacts of the target function.

Figure 1: A depiction of the standard supervised learning paradigm. The universe of discourse $\mathcal{X}$ is assumed to contain a target function $f$ and hypothesis spaces $V_n$ are judiciously chosen based on the algorithm of choice. $P^\#$ denotes the empirical risk minimizer, $\tilde{P}$ denotes the minimizer of the generalization error, and $P^*$ denotes the best approximation.

The author demonstrates that explicit, data-dependent construction can yield practical rates of approximation that are fundamentally more informative than global, existential bounds. For example, degree-of-approximation results relying on unavailable information such as best-approximation coefficients can mislead practical model design.

Constructive Approximation on Manifolds Without Optimization

The dissertation’s first technical core is a constructive approximation method on unknown manifolds. Building from multivariate trigonometric and spherical harmonics theory, the method projects (possibly noisy) data from an unknown $q$ -dimensional submanifold of $\mathbb{R}^Q$ (or equivalently, $\mathbb{S}^Q$ ) onto the sphere, then forms a kernel-based interpolant directly from samples without optimization or manifold learning:

$F_{n}(\mathcal{D}; x) = \frac{1}{M}\sum_{j=1}^M z_j \Phi_{n,q}(x \cdot y_j)$

where $\Phi_{n,q}$ is a highly localized, polynomial kernel based on spherical harmonics and $n$ is a degree parameter. The key theoretical result gives explicit, high-probability, non-asymptotic error bounds in terms of the data manifold's dimension and the local smoothness of $f$ :

$\|F_{n}(\mathcal{D}; \cdot) - f\|_\mathbb{X} \lesssim ( {\|z\|} + \|f\|_{W_\gamma} ) \left( \frac{\log M}{M} \right)^{\gamma/(q+2\gamma)}$

for $f \in W_\gamma(\mathbb{X})$ . Notably, only the dimension $q$ is required as prior information; no manifold learning or tangent space estimation is performed. The method automatically produces locally adaptive error in regions of diverse smoothness.

Figure 2: A depiction of a new machine learning paradigm, where one constructs an approximation $\sigma_n$ in the space $V_n$ directly from the data. This is done in such a way that one can also measure a direct reconstruction error from the approximation to the target function.

Numerical experiments showcase sharp localization of errors to function singularities and superior percentiles of pointwise error compared to RBF and Nadaraya-Watson estimators, even when global RMS errors are comparable (see Figure 3).

Figure 3: Error comparison between our method, the Nadaraya-Watson estimator, and an interpolatory RBF network. (Left) Comparison of absolute errors between the methods with the target function plotted on the right y-axis for benefit of the viewer. The error from the RBF method is scaled by $10^{-3}$ .

Localized Transfer Learning via Joint Data Spaces

The second major contribution extends the approximation-theoretic paradigm to a principled treatment of transfer learning. Here, transfer is modeled as the explicit lifting of a function from a source (base) manifold to a target manifold, potentially via known correspondences (e.g., landmark points, operator SVDs). A joint kernel is constructed respecting the geometry and spectral properties of both spaces:

$\Phi_n(H, \Xi_1, \Xi_2; x_1, x_2) = \sum_{j,k} H\left(\frac{\ell_{j,k}}{n}\right) A_{j,k} \phi_{1,j}(x_1) \phi_{2,k}(x_2)$

where $(\lambda_{i,j}, \phi_{i,j})$ are spectral triples and $A_{j,k}$ are connection coefficients. The main theorem specifies on which regions of the target space the lifted function is defined from data available only in a region of the source space, and quantifies how source smoothness $(\gamma)$ and geometric compatibilities control target space smoothness $(\gamma-Q+q_2)$ .

This framework is illustrated through a detailed analysis of Jacobi polynomial expansions and their transplants, linking ML transfer learning to inverse problems such as limited angle tomography and Radon inversion.

Classification as Measure Support Estimation: The MASC Algorithm

The final part of the dissertation reconceptualizes classification as the problem of nonparametric support estimation for a mixture measure $\mu = \sum_k a_k \mu_k$ , where each $\mu_k$ represents a (possibly overlapping) class-conditional distribution. Rather than approximating conditional expectations, the method seeks to partition the data into $K_\eta$ clusters via multiscale, kernel-based density estimation:

$F_n(x) = \frac{1}{M}\sum_{j=1}^M \Psi_n(x, x_j)$

with $\Psi_n(x,y) = \Phi_n(\rho(x, y))^2$ and localized trigonometric $\Phi_n$ . Theoretical results guarantee that for proper threshold $\Theta$ and kernel scale $n$ , the set

$\mathcal{G}_n(\Theta) = \{ x : F_n(x) \geq \Theta \cdot \max F_n(x_j) \}$

recovers a tight neighborhood of the true support $\mathbb{X}$ , and, under a quantitative fine-structure property, yields clusters of minimal separation $\eta$ with vanishing overlap. This leads to competitive or superior F-scores in the limit. The accompanying Multiscale Active Super-resolution Classification (MASC) algorithm actively queries minimal points to label clusters in a multiscale fashion, followed by nearest-neighbor extension.

Empirical results (Figures 15–25) on synthetic data, document classification, and hyperspectral imaging demonstrate that MASC achieves competitive accuracy with markedly reduced labeling cost compared to existing active learning algorithms (LAND, LEND), and provides robust support recovery even in the presence of significant overlap.

Figure 4: This figure illustrates the result of applying MASC to a synthetic circle and ellipse data set. On the left are true labels of the given data, and on the right is the estimation attained by MASC.

Figure 5: Plots indicating the accuracy of MASC, LAND, and LEND for different query budgets, for both Salinas (left) and Indian Pines (right).

Numerical and Implementation Details

The constructive methods support fast and matrix-based implementations (kernel evaluations via Clenshaw’s algorithm) and scale linearly in the number of samples, with polynomial hyperparameters controlling locality and adaptiveness. Hyperparameters such as $n$ (kernel degree), $q$ (manifold dimension), and density thresholds are shown to directly trade off bias, variance, and computational cost. Select experiments highlight the substantial practical impact, both for regression and classification, and the strong out-of-sample extension properties.

Implications and Theoretical Advances

This work rigorously bridges modern harmonic analysis and approximation theory with practical algorithmic designs for learning on high-dimensional, structured, or unknown domains. It provides a systematic inversion of the standard narrative: rather than using learning theory to motivate approximation, it uses constructive approximation theory to derive novel, efficient, and robust machine learning algorithms—thereby achieving learning without training.

Several claims in the dissertation run counter to prevailing practice:

Existence-based universal approximation is insufficient: Constructive, data-driven approximation yields explicit rates and local adaptivity that are not accessible via generic universality theorems.
Optimization (training) is not inevitable: For wide classes of high-dimensional regression and classification, explicit kernel-based constructions can attain optimal rates without iterative optimization, regularization, or model selection.
Separation of manifold learning and function approximation is not necessary: The presented methods bypass the need for manifold estimation or eigen-decomposition, requiring only the manifold’s dimension.
Classification need not rely on function approximation: Direct support recovery and measure partitioning can outperform standard likelihood- or regression-based approaches, especially in the presence of non-disjoint class supports.

Future Directions and Outlook

The dissertation opens multiple research avenues, including:

Further generalizations to arbitrary compact metric spaces and settings with unknown or variable regularity.
Operator approximation via representations induced by these encodings (see manifold operator learning).
Joint feature and function learning, extending the approach to automatic or adaptive feature discovery.
Practical deployment in large-scale or resource-constrained scenarios, enabling efficient on-device learning via direct approximation.

The approach delineated fundamentally reconfigures the theoretical and practical interface between approximation theory and machine learning, offering promising new methodologies for interpretable, robust, and efficient AI systems.

Markdown

Paper to Video (Beta)

All Videos Create Your Own

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Clear, Simple Summary of “Learning Without Training”

What is this paper about?

This dissertation explores how to make machines learn from data without the usual long, complicated “training” process. Instead of relying on trial-and-error methods like gradient descent (which can be slow, unstable, and get stuck), it uses direct mathematical recipes to build accurate models quickly. It focuses on three areas:

Supervised learning on complicated, high‑dimensional data
Transfer learning (reusing what you learned in one place to help in another)
Fast, accurate classification with active learning (smartly choosing which labels to ask for)

What questions does it try to answer?

In simple terms, the paper asks:

Can we build good prediction models directly from data without heavy training?
How can we reuse what we learned on one kind of data to help with another (transfer learning)?
Can we classify data well by asking for only a few labels, and do it very fast?

Key Ideas and Approach (Explained Simply)

1) Learning without training: a direct recipe

Think of trying to predict something (like weather) from many inputs. The usual way is to pick a model (like a neural network) and “train” it by slowly tweaking numbers to reduce error. That can take a long time and may fail.

This work offers a different path: constructive approximation. Instead of training, it uses a ready‑made formula that averages nearby data points using a carefully designed “smoothing stencil” (called a kernel). It’s like taking a blurry picture on purpose—but with a smart blur that keeps important details and sharp edges where they matter. The recipe depends on how smooth the true function is near each point, so it adapts locally rather than using one global error score that can hide local problems.

Key takeaway: you can often compute a high‑quality predictor directly from the data using a special averaging formula—no iterative training loop required.

2) Beating the “curse of dimensionality” with manifolds

High‑dimensional data (many features) is hard to handle because you need tons of samples to cover all possibilities. But in real life, data often lies on a much simpler, curved surface inside that big space, called a manifold. For example, photos of a rotating object live on a low‑dimensional “surface” of possibilities.

This work uses the manifold idea to reduce complexity. It borrows tools from geometry and physics—especially the idea of heat flow on a surface. Imagine placing heat at a point on a surface and watching it spread out. That pattern tells you about the shape of the surface. Using related math (graph Laplacians and heat kernels), the author builds localized, shape‑aware averaging tools that work on the manifold the data actually lives on. This gives better approximations with fewer data points.

3) Transfer learning as “lifting” between surfaces

Transfer learning asks: if I know something on one surface (manifold), can I use it to help on another? The paper treats this as lifting a function from one surface to another, like mapping a weather pattern from one map to a new map with a different projection. It studies:

Where on the target surface this lifting is possible if you only see part of the data
How the smoothness (how “wiggly” the function is) changes under the lift It even connects this to famous inverse problems (like reconstructing an image from its X‑ray scans), showing that transfer learning and these inverse tasks share the same underlying math.

4) Fast classification via “signal separation” and active learning

Imagine a music track with two instruments mixed together. Signal separation tries to pull them apart. The paper treats classification similarly: each class is like an instrument, and data from each class comes from its own “region” in space. Using this viewpoint, the author designs a method for active learning: instead of labeling everything, you smartly ask for a few labels that teach you the most. The result is a new algorithm that reaches accuracy similar to other modern methods but much faster.

What did the paper find?

Direct, training‑free formulas can be just as accurate as trained models in many cases.
- They come with guarantees: the error shrinks at a known rate that depends on how smooth the true function is and how many data points you have.
- They are “local”: they keep sharp features where needed instead of blurring everything to please a global error score.
- They avoid common training problems like getting stuck in bad solutions, picking learning rates, or deciding when to stop.
On manifolds, these methods handle high‑dimensional data more efficiently.
- Using heat‑kernel and Laplacian ideas, they build approximations that respect the data’s true shape.
- This helps dodge the curse of dimensionality and needs fewer samples for good accuracy.
For transfer learning, the paper pinpoints where and how functions can be carried from one surface to another.
- It clarifies the relationship between the function’s smoothness on the source and target.
- It links transfer learning to classic inverse problems, opening new ways to reuse solutions.
For active learning in classification, the new signal‑separation‑inspired algorithm is competitive in accuracy and significantly faster than many recent methods.

Why is this important?

Speed and stability: Skipping long training makes models faster to build and less fragile.
Better use of data: Local, geometry‑aware methods can capture fine details and work well with fewer samples.
Smarter reuse: The transfer learning viewpoint helps move knowledge between tasks more reliably, especially when data is incomplete.
Cheaper labeling: The active learning approach reduces how many labels you need without losing accuracy.

Bottom line

This dissertation shows that you can often “learn without training” by using smart mathematical constructions. By respecting the local smoothness of data and the low‑dimensional surfaces it lives on, you can build accurate, fast, and reliable predictors; transfer knowledge between tasks; and classify data with far fewer labels—all while avoiding many pitfalls of standard training.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list highlights what remains missing, uncertain, or unexplored based on the provided dissertation text. Each point is framed to be concrete and actionable for future work.

The constructive trigonometric approximation framework assumes periodic domains (T^d) and uniform sampling (marginal distribution equal to the Lebesgue probability measure on T^d); how to extend the theory and guarantees (e.g., Theorem 3.5 and bound ||\tilde{\sigma}_n - f||_\infty \lesssim n^{-\gamma}) to non-periodic domains, unknown supports, manifolds with boundaries, and non-uniform sampling distributions.
The discretized reconstruction operator \tilde{\sigma}_n(x) = (1/M) \sum_{j=1}^M y_j \Phi_n(x - x_j) is analyzed under noiseless labels; rigorous high-probability error bounds under label noise y_j = f(x_j) + \epsilon_j (including sub-Gaussian and heavy-tailed noise) are not provided, nor are noise-robust variants of the operator.
The dependence of sample complexity M \gtrsim n^{d+2\gamma} \log n on dimension d and smoothness \gamma is stated without explicit constants or practical guidance; data-driven procedures to select n (bandwidth) and M (sample size) adaptively to unknown local smoothness and heterogeneous sampling densities are missing.
The choice of the smoothing function h in the kernel \Phi_n and its impact on approximation quality, stability, localization, and computational complexity is not characterized; criteria or automated selection strategies for h are not developed.
The “good approximation” bound E_n(f) \le ||f - \sigma_n(f)|| \lesssim E_{n/2}(f) is stated globally, but a formal local-approximation theory (spatially adaptive rates that depend on the local smoothness or singularity structure of f) is not established beyond the qualitative example with f(\theta) = |\cos \theta|^{1/4}.
Robustness of the constructive approach to misspecification of the sampling distribution (e.g., clustered or highly non-uniform x_j), outliers, and adversarial noise is not analyzed; weighted or preconditioned versions of \tilde{\sigma}_n to correct for sampling biases remain an open design question.
The integral identity used to represent trigonometric expansions as neural networks (via e^{i k \cdot x} expressed with activation \phi) is a theoretical bridge, but finite, discrete, training-free network constructions (depth/width bounds, parameter quantization, and hardware-friendly architectures) implementing \tilde{\sigma}_n are not specified.
There is no empirical evaluation comparing the constructive approximation method against standard ERM-trained neural networks across real datasets (accuracy, runtime, memory), nor ablations that isolate the contribution of periodicity, kernel choice, and sampling distribution.
The dimension-independent existence bound on spheres (e.g., ||f - \sum_{k=1}^N a_k |x \cdot y_k||| \lesssim N^{-(d+3)/(2d)}) contrasts with dimension-dependent constructive bounds (\lesssim N^{-2/d}), but a pathway to close this gap with constructive, data-driven methods achieving dimension-independent rates is not articulated.
For spherical constructions requiring quadrature nodes exact for certain polynomial degrees, algorithms to obtain such nodes from scattered data and conditions under which approximate quadratures suffice (with quantified degradation in rates) are not provided.
The manifold learning discussion highlights sensitivity of the two-step pipeline (manifold estimation → function approximation) to parameters and noise, but does not provide a training-free function approximation method that bypasses explicit manifold estimation while preserving guarantees, nor guidelines to select diffusion scales, neighborhood radii, or eigen-truncation levels with provable risk control.
Assumptions such as Gaussian upper bounds on heat kernels and finite speed of propagation are invoked for localized kernels; practical verification procedures for these assumptions on unknown manifolds from finite samples (including curved, non-compact, and boundary manifolds) are not given.
The transfer learning project (lifting functions between manifolds, with partial data) lacks concrete algorithms and conditions for when the lifting is well-defined (e.g., identifiability and invertibility of the lifting operator), stability bounds under sampling and label noise, and precise characterizations of target subsets where lifting is possible.
Relationships between local smoothness of f and its lifted counterpart (regularity transfer laws) are described qualitatively; precise theorems, rates, and counterexamples for different manifold geometries and sampling regimes are not provided.
For inverse problems (e.g., inverse Radon transform) linked to transfer learning, there are no explicit sample complexity results, noise stability bounds, or guarantees under partial angular/radial coverage; algorithms bridging classical inversion formulas with data-driven lifting operators need development.
The classification via active learning and signal separation analogy (supports of class distributions as “point sources”) is introduced, but the assumptions required (e.g., separability conditions, mixture models, support geometry) and formal performance guarantees (sample/query complexity, label efficiency, consistency) are not specified.
The active learning query strategy (how to select points to query the oracle f for maximal information) is not detailed; stopping criteria with theoretical guarantees, robustness to label noise, extension to multi-class and imbalanced settings, and scalability in high dimensions remain open.
The critique of ERM and global loss functionals (insensitivity to local artifacts) motivates local methods, but a principled local risk functional, optimization-free estimator, or hybrid approach with explicit generalization bounds is not developed.
The analysis of gradient descent shortcomings (local minima, dead-on-arrival, false stabilization) lacks proposed remedies integrated with the dissertation’s constructive approach (e.g., initialization schemes, training-free alternatives, certified stopping criteria) and formal convergence guarantees for specific architectures and losses.
The nonlinear width lower bounds underscore the curse of dimensionality, but the dissertation does not characterize function classes (e.g., compositional, sparse, low-rank, or manifold-plus-sparsity) where training-free constructive methods can achieve dimension-free or improved widths with explicit algorithms and guarantees.
Methods to estimate or learn smoothness parameters (e.g., \gamma in W_\gamma) from data and to adapt reconstruction bandwidths n locally (spatial bandwidth selection) are not provided; designing adaptive estimators with oracle inequalities remains an open problem.
Computational aspects (time/space complexity, parallelization, GPU/TPU implementation) for the constructive operators and manifold kernels are not analyzed; practical pipelines and engineering considerations to make “learning without training” scalable to modern datasets are not addressed.

View Paper Prompt View All Prompts

Glossary

Active Learning: A learning paradigm where the algorithm can query an oracle for labels to maximize information gain. "Active learning incorporates ideas from both unsupervised and supervised learning."
Atlas: A collection of local coordinate charts that cover a manifold, used to perform computations locally. "One approach is to estimate an atlas of the manifold, which thereby allows function approximation to be conducted via local coordinate charts."
Banach space: A complete normed vector space often used as the setting for function approximation. "We assume that f belongs to some class of functions called the universe of discourse X (typically a Banach space),"
Barron Space: A function space characterized by integrability of the Fourier transform with polynomial weight, used in neural network approximation theory. "We say that f: \mathbb{R}^d\to \mathbb{R} belongs to the Barron Space with parameter s>0, denoted by B_s, if it satisfies the following norm condition"
Best approximation: The element of a hypothesis space closest to the target function in a chosen norm. "The best approximation, P^{*=\argmin_{P\in} V_n}E_n(f), is the model from V_n that minimizes the degree of approximation."
Chebyshev expansion: A series representation of a function in terms of Chebyshev polynomials, used for approximation. "For example, a shifted average of the partial sums of the Chebyshev expansion of f can be used in the uniform approximation case (p=\infty)."
Classification: A task where the target function outputs discrete class labels. "In classification problems, the function f is discrete, taking on only some finite set of values called class labels."
Curse of dimensionality: The phenomenon where the complexity or data requirements grow exponentially with dimension. "First, we examine a phenomenon known as the curse of dimensionality."
Dead on arrival: An initialization pathology where a neural network outputs a constant due to poor parameter initialization. "The phenomenon has become known as dead on arrival."
Degree of approximation: The minimal distance between a function and a hypothesis space under a norm. "The degree of approximation is defined to be the least possible distance from V_n to f."
Diffusion geometry: A framework using diffusion processes (e.g., heat kernels) to analyze geometric structure in data. "The special issue \cite{achaspissue} of Applied and Computational Harmonic Analysis (2006) provides a great introduction on diffusion geometry."
Diffusion maps (Dmaps): A manifold learning method based on diffusion processes to embed high-dimensional data. "diffusion maps (Dmaps) \cite{coifmanlafondiffusion}"
Empirical risk: A loss computed on the available dataset that approximates the expected loss. "Instead, one typically seeks to find a minimizer of the empirical risk, which is a discretized version of the generalization error based on the data."
Empirical risk minimization: The paradigm of training models by minimizing empirical loss over the dataset. "Both questions are essential to the performance of machine learning algorithms trained by empirical risk minimization"
Eigendecomposition: Decomposition of an operator into eigenvalues and eigenfunctions/vectors, used for manifold analysis. "It has been shown that the so-called graph Laplacian (and the corresponding eigendecomposition) constructed from data points converges to the manifold Laplacian and its eigendecomposition"
Exclusive-OR function: A nonlinearly separable boolean function used to test universality of models. "it is known that if \phi(t)=t, then there is no network which can reproduce even the exclusive-OR function."
False stabilization: Apparent convergence of an optimization process that later continues changing. "this runs into the issue of false stabilization, or reaching a point where the iterations seem to converge to a point but in actuality will continue changing given enough iterations."
Fourier coefficients: Integrals defining the frequency components of a function on the torus. "The Fourier coefficients of a function f\in L^{1(\mathbb{T}^d)} are defined by"
Fourier projection: The projection of a function onto trigonometric polynomials via its Fourier coefficients. "The best approximation in the sense of the global L² norm is given by the Fourier projection, defined by"
Gaussian upper bounds: Bounds on the heat kernel exhibiting Gaussian decay, linked to wave propagation properties. "equivalent to the so called Gaussian upper bounds on the heat kernels."
Generalization error: The expected loss over the true data distribution, not just the training set. "The primary approach to select a model from V_n is to introduce the notion of a generalization error, which is given as a loss functional"
Graph Laplacian: A discrete Laplacian constructed from data points that approximates the manifold Laplacian. "It has been shown that the so-called graph Laplacian (and the corresponding eigendecomposition) constructed from data points converges to the manifold Laplacian and its eigendecomposition"
Gradient descent: An iterative optimization method that updates parameters in the negative gradient direction. "we will limit our discussion to a commonly used method known as gradient (or steepest) descent."
Heat kernel: The fundamental solution to the heat equation on a manifold, used for approximation and embeddings. "Another important tool is the theory of localized kernels based on the eigen-decomposition of the heat kernel."
Hessian locally linear embedding (HLLE): A variant of LLE that uses Hessian-based constraints for dimensionality reduction. "Hessian locally linear embedding (HLLE) \cite{david2003hessian}"
Hypothesis spaces: Families of functions chosen to model the target function. "then decide on some hypothesis spaces V_n of functions to model f by"
Isomaps: A nonlinear dimensionality reduction method preserving geodesic distances. "including Isomaps \cite{tenenbaum2000global}"
Kolmogorov function: An activation function that yields universal approximation with shallow networks. "Any activation function which yields a family of universal approximator neural networks is called a Kolmogorov function."
Laplace-Beltrami operator: The intrinsic Laplacian on a Riemannian manifold. "heat kernel corresponding to the Laplace-Beltrami operator on the manifold."
Laplacian eigenmaps (Leigs): A spectral embedding method using the graph Laplacian’s eigenvectors. "Laplacian eigenmaps (Leigs) \cite{belkinlaplacian}"
Learning rate: The step size parameter in iterative optimization updates. "where \eta is called the learning rate, or step size."
Local tangent space alignment (LTSA): A manifold learning technique that aligns local tangent spaces. "local tangent space alignment (LTSA) \cite{zhang2004principal}"
Locally linear embedding (LLE): A manifold learning algorithm preserving local linear relationships. "locally linear embedding (LLE) \cite{roweis2000nonlinear}"
Localized kernels: Kernels concentrated around points used for local approximation and multi-resolution analysis. "the theory of localized kernels based on the eigen-decomposition of the heat kernel."
Manifold assumption: The hypothesis that data lie near a low-dimensional manifold in high-dimensional space. "This has become known as the manifold assumption."
Manifold learning: Techniques that infer manifold structure from data and use it for tasks like approximation. "The purpose of this section is to introduce a relatively new paradigm of manifold learning."
Marginal distribution: The distribution of a subset of variables (e.g., inputs) derived from a joint distribution. "Let the marginal distribution of the points {x_j} be \mu_d^*."
Mean-squared-error (MSE): A common loss function measuring squared differences between predictions and targets. "Many choices can be used for such a loss functional, but perhaps the most common example is mean-squared-error (MSE), which is given as"
Maximum variance unfolding (MVU): A dimensionality reduction method equivalent to a semidefinite program. "maximum variance unfolding (MVU) which is also known as semidefinite programming (SDP)"
Moving least-squares: A local regression technique used for approximation on manifolds. "Approximations utilizing estimated coordinate charts have been implemented, for example, via deep learning \cite{cloninger-net,coifman_deep_learn_2015bigeometric,schmidt2019deep}, moving least-squares"
Nonlinear width: A measure of the best possible approximation error achievable by any continuous parameterization and reconstruction. "The nonlinear L^p width is defined by"
Oracle: A ground-truth function that can be queried at a cost to obtain labels in active learning. "we are also given f called an oracle in this context."
Quadrature formula: A numerical integration rule exact for polynomials up to a given degree. "which admit a quadrature formula exact for integrating spherical polynomials of a certain degree"
Radial basis function (RBF) networks: Networks using radially symmetric basis functions for approximation. "when one considers approximation by radial basis function (RBF) networks, it is observed in many papers (e.g., \cite{eignet})"
Radon transform: An integral transform mapping a function to its integrals over hyperplanes, appearing in inverse problems. "such as the inverse Radon transform"
Rectified linear unit (ReLU): A piecewise-linear activation function defined as max(0, x). "Another example is the popular choice of activation function known as the rectified linear unit (ReLU), defined by"
Regression: A task where the target function outputs continuous values, possibly with noise. "In regression problems the function f may take on any value on a continuum"
Remez algorithm: An iterative method to compute near-best polynomial approximations in the uniform norm. "the process may be aided by methods such as the Remez algorithm."
Semidefinite programming (SDP): Convex optimization over positive semidefinite matrices, used in MVU. "maximum variance unfolding (MVU) which is also known as semidefinite programming (SDP) \cite{weinberger2005nonlinear}"
Shallow neural network: A single-hidden-layer neural network used for function approximation. "A shallow neural network is a function approximation model typically taking the form"
Sigmoidal activation function: An S-shaped nonlinearity used in neural networks. "the function \phi(x)=\tanh(x) is a sigmoidal activation function"
Sobolev space: A function space with integrable derivatives up to a certain order. "The Sobolev space with parameters r,p on a set \mathbb{X}\subseteq \mathbb{R}^d is defined as W^{d_{r,p}(\mathbb{X})="}
Stone-Weierstrass theorem: A theorem guaranteeing uniform approximation of continuous functions by polynomials. "From the Stone-Weierstrass theorem, we know that if f\in \mathcal{X}, then for any \epsilon>0 there exists n and P\in V_n such that"
Supervised Learning: A paradigm where models learn from labeled data to predict outputs for unseen inputs. "Supervised Learning: The main goal of supervised learning is to generate a model to approximate a function f on unseen data points."
Transfer learning: Leveraging knowledge from one domain to improve performance in another domain. "The second project deals with transfer learning, which is the study of how an approximation process or model learned on one domain can be leveraged to improve the approximation on another domain."
Trigonometric polynomials: Finite sums of complex exponentials used to approximate periodic functions. "We introduce the approximation of multivariate 2\pi-periodic functions by trigonometric polynomials."
Universal approximation property: The ability of a hypothesis family (e.g., networks) to approximate any continuous function on compact sets. "We say that a sequence of hypothesis spaces, {V_n}, satisfies a universal approximation property if"
Universe of discourse: The function space within which the target function is assumed to reside. "We assume that f belongs to some class of functions called the universe of discourse \mathcal{X}"
Variational modulus of smoothness: A quantity measuring function smoothness used in approximation bounds for multilayer networks. "In \cite{mhaskar-multilayer}, degree of approximation results were shown for multilayer neural networks in terms of a variational modulus of smoothness."
Wave propagation: The physical phenomenon whose finite speed property relates to kernel localization on manifolds. "finite speed of wave propagation."

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below is a concise list of near-term, deployable uses that build on the paper’s constructive approximation, manifold-based methods, and fast active learning insights. Each item includes the sector(s), a brief description of how it would work in practice, and key assumptions/dependencies.

Training-free regression for periodic/time-angle signals
- Sectors: signal processing, robotics, geoscience, telecom, energy
- What: Use the constructive “good approximation” operator (σₙ) with localized trigonometric kernels to fit periodic or angular signals (e.g., phase, direction, cyclical time features) directly from sampled data—no gradient-based training.
- Tools/workflow: Precompute kernel Φₙ; compute σ̃ₙ(x)=M⁻¹∑ⱼyⱼΦₙ(x−xⱼ); optionally wrap as a fixed-weight “network” for deployment; integrate into Python/R pipelines for fast, training-free regression.
- Assumptions/dependencies: Target is (approximately) periodic or defined on tori/angles; mild smoothness on f; sample budget M scales with frequency budget n and effective dimension d via M≳n^{d+2γ}log n; data roughly cover the domain (no large holes).
On-device, low-power inference with fixed networks (no training)
- Sectors: embedded/IoT, mobile, robotics
- What: Replace trained regressors with fixed-weight networks derived from σ̃ₙ, enabling deterministic, low-latency inference on constrained hardware.
- Tools/workflow: Offline construction of weights via kernel discretization; deploy with quantization; integrate into microcontroller ML stacks (e.g., TensorFlow Lite Micro) since runtime is just linear ops.
- Assumptions/dependencies: Domain structure is known or approximable (e.g., periodic coordinates or graph-based coordinates); sufficient coverage in the calibration dataset.
Fast active learning for classification via signal-separation principles
- Sectors: remote sensing (hyperspectral), document classification, cybersecurity (anomaly/malware), manufacturing QA
- What: Use the paper’s active learning algorithm that treats class supports analogously to signal sources, prioritizing queries that separate supports; competitive accuracy with markedly lower compute time.
- Tools/workflow: Plug-in acquisition strategy for pool-based active learning loops (e.g., scikit-learn, PyTorch Lightning); deploy to reduce labeling costs in high-volume pipelines.
- Assumptions/dependencies: Classes correspond to distinguishable supports in feature space; an oracle for labels is available; feature scaling/metric selection amplifies support separation.
Semi-supervised learning on graphs/manifolds using localized kernels
- Sectors: healthcare (risk stratification using patient similarity graphs), recommender systems, social networks, e-commerce
- What: Build label propagation/regression on graph Laplacians using heat-kernel-based localized frames; avoids heavy global loss minimization and adapts to manifold structure.
- Tools/workflow: Construct kNN graph → compute (approximate) Laplacian eigensystem or heat kernel → apply localized kernel synthesis for interpolation; use libraries for spectral graph methods.
- Assumptions/dependencies: Graph captures manifold geometry (e.g., meaningful affinity kernel); sufficient sampling and connectivity; Gaussian upper bounds (or practical proxies) hold approximately.
Local-error-aware model diagnostics and post-processing
- Sectors: software/MLOps, regulated industries (finance, healthcare)
- What: Apply σₙ-based local approximations to assess and correct global-model artifacts (e.g., ringing near singularities or sharp transitions); improves reliability without retraining.
- Tools/workflow: Post-hoc local smoothing/denoising with constructive kernels; targeted refinement near detected singularities; CI/CD hooks for model health checks.
- Assumptions/dependencies: Access to residuals or validation data; local smoothness varies across domain; computational budget for local reconstructions.
Rapid baselines and AutoML components for high-dimensional problems
- Sectors: enterprise AI platforms, consulting/analytics
- What: Add “learning without training” baselines to AutoML for quick feasibility checks and as strong non-optimized benchmarks (especially where domain geometry is known or estimable).
- Tools/workflow: Auto-select periodic/angular embeddings or construct graphs; choose n via held-out error; deploy as a baseline or ensemble component.
- Assumptions/dependencies: Reasonable manifold/periodicity proxies; moderate sample sizes to estimate kernels reliably.
Limited-angle or sparse-view tomographic reconstruction (partial transfer)
- Sectors: medical imaging (CT), industrial NDT, geophysics
- What: Use the transfer-learning-as-lifting viewpoint to identify regions in the image domain where reliable inversion from partial Radon data is possible; guide reconstruction and uncertainty maps.
- Tools/workflow: Map sinogram subsets → reconstruct only guaranteed regions; overlay confidence; integrate with existing iterative solvers to prioritize where physics permits stable lifting.
- Assumptions/dependencies: Known forward operator (Radon), partial data patterns compatible with theoretical lifting domains; acceptance of region-restricted reconstructions in practice.

Long-Term Applications

These opportunities require additional research, scaling, or integration before mainstream deployment, but they flow naturally from the dissertation’s theoretical advances.

General-purpose “learning without training” libraries for arbitrary manifolds
- Sectors: software tooling, scientific ML
- What: A production-grade library that constructs localized kernels on unknown manifolds from data (via graph Laplacians/heat kernels) and performs training-free regression/classification with theoretical error guarantees.
- Tools/products: Open-source package (e.g., “NoTrain”), automated graph construction, scalable spectral approximations (Nyström, randomized eigensolvers), hyperparameter selection for n, γ, and sample budgets.
- Assumptions/dependencies: Robust, scalable estimation of manifold geometry under noise; validated Gaussian bounds or substitutes; guidelines for sample complexity in diverse domains.
Safe and certifiable transfer learning via manifold lifting
- Sectors: healthcare (cross-site model adaptation), autonomous systems, finance (cross-market transfer)
- What: Formalize and implement transfer procedures that only act on target subdomains where lifting is provably defined, with local smoothness-preserving guarantees—reducing negative transfer.
- Tools/products: “Transfer eligibility maps” for target data; conservative adaptation workflows in MLOps; explainable transfer reports for regulators.
- Assumptions/dependencies: Existence and identifiability of lifting maps; methods to estimate local smoothness and overlap between source/target manifolds; labeled or structured partial data.
Physics-guided inverse problem solvers with partial-data guarantees
- Sectors: medical imaging, materials science, seismology
- What: Embed lifting-based constraints and local smoothness relations into solvers for limited/contaminated measurements (e.g., limited-angle CT, sparse MRI), focusing reconstruction where theory guarantees stability.
- Tools/products: Hybrid iterative solvers with lifting constraints; region-specific confidence intervals; operators integrated with hospital PACS/industrial QA systems.
- Assumptions/dependencies: Accurate forward operators and noise models; validated mappings from measurement space to reconstruction subdomains.
Low-power, training-free edge AI co-processors
- Sectors: semiconductors, IoT, wearables
- What: Hardware accelerators specialized to fixed-kernel constructive inference (convolutions with data-derived kernels), enabling real-time edge analytics without cloud training or updates.
- Tools/products: ISA extensions for kernel synthesis and evaluation; memory-efficient kernel storage; co-design of data collection and kernel construction pipelines.
- Assumptions/dependencies: Stable kernel derivation offline; standardized representations across deployments; ecosystem and tooling to adopt new hardware capabilities.
Active learning for scientific discovery and experiment design
- Sectors: materials discovery, biology, chemistry
- What: Use support-separation principles to choose minimal, most-informative experiments (label queries) that delineate regimes or phases, accelerating discovery with fewer trials.
- Tools/products: Lab-in-the-loop active learning platforms with acquisition strategies grounded in support separation; integration with ELNs and automation.
- Assumptions/dependencies: Features reflect underlying physics (separable supports); reliable measurement oracles; mechanisms to update supports as new data arrives.
Privacy-preserving/federated constructive learning
- Sectors: healthcare, finance, mobile platforms
- What: Compute localized kernels and fixed models on-device or on-prem, then share only aggregated kernel summaries; avoids sharing raw data and bypasses training-phase leakage.
- Tools/products: Federated protocols for sharing kernel moments or quadrature weights; differential privacy add-ons.
- Assumptions/dependencies: Secure aggregation; theoretical understanding of privacy leakage from shared kernel summaries.
Robust ML that avoids global-loss pitfalls
- Sectors: regulated AI, safety-critical systems
- What: Replace or augment global-loss minimization with locally adaptive constructive methods that preserve important local features (edges, singularities, rare events).
- Tools/products: Hybrid pipelines that switch to constructive approximations near detected irregularities; certification suites for local error bounds.
- Assumptions/dependencies: Reliable detection of non-smooth regions; calibration of local vs global trade-offs; domain-specific validation.
Cross-modal and multi-sensor alignment via manifold lifting
- Sectors: autonomous vehicles, remote sensing, AR/VR
- What: View cross-modal alignment (e.g., LiDAR↔camera) as lifting between manifolds; design algorithms that identify subspaces of safe transfer and preserve local smoothness for fusion.
- Tools/products: Sensor fusion modules with lifting eligibility checks; per-region confidence maps; failure-aware fallbacks.
- Assumptions/dependencies: Sufficient overlapping structure between modality manifolds; synchronization and calibration quality; computationally tractable lifting estimators.
Curriculum and pedagogy: algorithmic alternatives to training
- Sectors: education, workforce upskilling
- What: Teach constructive approximation and manifold-based learning as practical alternatives to black-box training, highlighting sample complexity and local-error behavior.
- Tools/products: Course modules, interactive notebooks, and lab assignments; benchmarking kits comparing training-free and trained models.
- Assumptions/dependencies: Access to datasets with periodic/manifold structure; instructor familiarity with spectral/graph methods.

These applications collectively highlight a path toward faster, more explainable, and resource-efficient machine learning pipelines—particularly in settings where domain geometry can be leveraged and where global empirical-risk optimization is costly or brittle.

View Paper Prompt View All Prompts

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Learning Without Training

Summary

Learning Without Training: A Theoretical and Algorithmic Paradigm Shift in Machine Learning

Introduction

Critique of Current Supervised Learning Paradigm

Constructive Approximation on Manifolds Without Optimization

Localized Transfer Learning via Joint Data Spaces

Classification as Measure Support Estimation: The MASC Algorithm

Numerical and Implementation Details

Implications and Theoretical Advances

Future Directions and Outlook

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Clear, Simple Summary of “Learning Without Training”

What is this paper about?

What questions does it try to answer?

Key Ideas and Approach (Explained Simply)

1) Learning without training: a direct recipe

2) Beating the “curse of dimensionality” with manifolds

3) Transfer learning as “lifting” between surfaces

4) Fast classification via “signal separation” and active learning

What did the paper find?

Why is this important?

Bottom line

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Practical Applications

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Authors (1)

Collections

Tweets

Learning Without Training

Summary

Learning Without Training: A Theoretical and Algorithmic Paradigm Shift in Machine Learning

Introduction

Critique of Current Supervised Learning Paradigm

Constructive Approximation on Manifolds Without Optimization

Localized Transfer Learning via Joint Data Spaces

Classification as Measure Support Estimation: The MASC Algorithm

Numerical and Implementation Details

Implications and Theoretical Advances

Future Directions and Outlook

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Clear, Simple Summary of “Learning Without Training”

What is this paper about?

What questions does it try to answer?

Key Ideas and Approach (Explained Simply)

1) Learning without training: a direct recipe

2) Beating the “curse of dimensionality” with manifolds

3) Transfer learning as “lifting” between surfaces

4) Fast classification via “signal separation” and active learning

What did the paper find?

Why is this important?

Bottom line

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Practical Applications

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Related Papers

Authors (1)

Collections

Tweets