Tensor Programs: Theory and Systems
- Tensor Programs are a unified abstraction that formalizes neural network computation and system-level optimizations for high-dimensional array processing.
- They model operations as sequential matrix multiplications, nonlinear activations, and moment computations to capture both analytic and implementation perspectives.
- This framework underpins advanced compilation, autotuning, and latency modeling, enabling rigorous analysis of wide networks and efficient cross-platform deployment.
Tensor programs are a foundational abstraction in both the mathematical theory of neural networks and the system-level engineering of efficient computation over high-dimensional arrays. Originating as a formal language for expressing both the computation and program transformations of neural architectures, the tensor program paradigm unifies the description of network forward/backward passes, operator scheduling, loop-level implementation, and system-level optimizations across a variety of heterogeneous hardware backends. This abstraction encompasses (1) the high-level algebraic computation as computation graphs or index notation, (2) the low-level imperative code with explicit loop and memory hierarchy, and (3) the formal analysis of training/infinite-width limits in wide neural networks using probabilistic and random matrix theoretical techniques.
1. Mathematical Formalism of Tensor Programs
A tensor program can be viewed as a finite sequence of statements acting on vectors (length ), matrices (typically for width ), and scalars, written in a "miniature assembly language" with three core instructions: (i) matrix-vector multiplications (MatMul), (ii) coordinatewise nonlinearities (Nonlin or Nonlin, supporting additional scalar parameters), and (iii) empirical averages or moments (Moment) (Yang, 2019, Yang, 2020, Yang et al., 2021). Every variable in the computation carries explicit shape information and, under standard neural scaling, weight matrices are initialized with entries. The program expresses, in a compositional and inductive manner, the computation of both forward passes and backward (gradient) propagation in neural nets, network architectures with skip connections, attention, normalization, and beyond.
A typical Tensor Program for a deep network consists of a sequence of MatMul and Nonlin instructions, parameterized by learnable weights and inputs, corresponding directly to the layered computation in multilayer perceptrons, convolutional nets, recurrent nets, attention mechanisms, and composition of standard modules (Yang, 2019, Yang, 2020).
The formalism enables precise characterization of the infinite-width asymptotics of such programs, initially showing convergence to Gaussian processes for randomly initialized networks, then supporting the derivation of Neural Tangent Kernels (NTK) and their universality across architectures (Yang et al., 2021, Yang, 2020).
2. System-Level Representation and Optimization
In systems and compilers, "tensor program" refers to the low-level specification and schedule of a computation over tensors, rendered as nested loops, operator DAGs, and explicit memory operations (Chen et al., 2018, Zheng et al., 2020, Wu et al., 16 Apr 2026). Each tensor program captures:
- The sequence of kernel computations (e.g., MatMul, convolution, elementwise ops) with explicit tensor shapes and dependence.
- The memory hierarchy (global, shared, register, or even device-specific memory spaces).
- The mapping from computational indices to parallel program structure (threads, warps, blocks in the case of GPUs).
- Transformation schedule: detailing loop tile sizes, ordering, parallelization, vectorization, data layout, and operator fusion.
Compilers and superoptimizers organize this space hierarchically, representing high-level algebra (computation DAG or AST), schedule-level tesselation and mapping, and hardware instantiation including explicit mappings to device resources (Ding et al., 2022, Wu et al., 16 Apr 2026, Wu et al., 2024).
3. Optimization Methodologies for Tensor Programs
Optimization of tensor programs occurs at multiple layers:
- Symbolic Superoptimization: Tools such as Prism and Mirage define hierarchical or symbolic graph representations (sGraph, μGraph) which encode families of tensor programs, enable reasoning about algebraic and schedule transformations, and permit structured pruning of provably suboptimal configurations via symbolic dimension-matching and algebraic rewriting (via e-graph methods). Equivalence to the source program is verified through rewrite-based or probabilistic algorithms (Wu et al., 16 Apr 2026, Wu et al., 2024).
- Auto-tuning and Cost Modeling: Systems like Ansor and ATiM define joint or hierarchical search spaces over tile sizes, memory mappings, and hardware-level parallelization. These systems deploy surrogate cost models (boosted trees, neural nets) to predict runtime and gradient-based or evolutionary search strategies to explore the configuration space, subject to functional correctness (Chen et al., 2018, Zheng et al., 2020, Shin et al., 2024).
- Logical and Physical Query Optimization: Platforms such as Galley abstract sparse tensor computations by first constructing a logical plan (algebraic sequence of maps and aggregates), then mapping it to a physical plan which details the concrete loop orders, data formats, and merge protocols, optimizing cost models for both logical and physical execution (Deeds et al., 2024).
4. Theoretical Implications in Neural Network Analysis
The Tensor Programs framework enables rigorous analysis of the behavior of wide neural networks, providing, via the Master Theorem, convergence laws for the empirical distribution of vector entries to random variables determined by program structure (Yang, 2019, Yang et al., 2021, Yang, 2020). Key results include:
- NN-GP Correspondence: Wide networks at random initialization are Gaussian processes regardless of architecture, provided the computation is properly encoded as a Tensor Program (Yang, 2019).
- NTK Universality: Both the kernel at initialization and the full gradient-training dynamics (in the NTK parametrization) converge to deterministic evolution, with explicit kernel recursion formulas valid for MLPs, CNNs, ResNets, LSTMs, Transformers, and more (Yang, 2020, Yang et al., 2021).
- Free Independence and Neural Matrix Laws: The programmatic composition of weight matrices and nonlinearities yields asymptotic freeness in random matrix theory, enabling predictions of Jacobian spectra, justification for the gradient independence assumption, and exact dynamical isometry analysis (Yang, 2020).
- Extension to Adaptive Optimizers: The framework naturally extends to adaptive methods (e.g., Adam), using a generalized program language (NEXORT) and bra-ket notation to describe nonlinear kernel operators and feature-learning/maximal-update dynamics (Yang et al., 2023).
5. Platform Portability and Automated Transcompilation
The proliferation of heterogeneous device backends (NVIDIA CUDA, AMD HIP, Intel VNNI, proprietary ASICs, DRAM-PIM) demands platform-agnostic yet high-performance tensor program representations. QiMeng-Xpiler synthesizes cross-device low-level programs via a neural-symbolic approach, composing LLM-guided code transformation passes (via meta-prompts) with SMT-based symbolic repair, and orchestrating auto-tuning via brute-force intra-pass and MCTS-based inter-pass strategies (Dong et al., 4 May 2025). This approach supports "Write Once, Run Anywhere" for tensor programs, robustly translating across programming models with high accuracy and performance.
6. Device- and Model-Agnostic Latency Modeling
Latency prediction and schedule selection for tensor programs on novel models and hardware platforms depends on efficient representation and adaptation. CDMPP demonstrates that compact AST representations, coupled with transformer-based encoders and domain-adaptive objectives, enable device- and model-invariant learned cost models with error rates of 10–15% across unseen operators and devices, drastically reducing both profiling cost and training time relative to traditional feature or LSTM-based approaches (Hu et al., 2023).
7. Future Directions and Open Problems
Several research bottlenecks and future directions include:
- Symbolic Coverage and Rewrite Completeness: Current symbolic superoptimizers such as Prism and Mirage depend on manual incorporation of rewrite axioms. Extending these systems to automatically discover domain-specific rewrites, supporting nested or more complex control flow, and scaling mapping enumeration for multiple loop-nest dimensions remains an open area (Wu et al., 16 Apr 2026, Wu et al., 2024).
- Hybrid Optimization: Combining the strengths of neural surrogate models (Ansor, AutoTVM) and symbolic pruning (Prism) to accelerate search while maintaining global optimality (Chen et al., 2018, Wu et al., 16 Apr 2026).
- Unified Theoretical–System Analysis: Bridging the infinite-width, kernel-method theory of Tensor Programs with practical, finite-width operator fusion, compression, and dynamic dataflow analysis to characterize feature learning and optimization in modern adaptive DNNs (Yang et al., 2023, Sakai et al., 1 Jun 2025).
- Generalization to Graph, Sparse, and Flexible Storage: Enhanced frameworks for representing and optimizing tensor programs under arbitrary sparsity patterns, storage formats, and memory layouts, with performance-portable code generation (Schleich et al., 2022, Deeds et al., 2024).
The tensor programs abstraction thus provides a unified foundation for both mathematical analysis and systems-level optimization in modern deep learning, unifying probabilistic kernels, operator scheduling, semantic program transformation, and hardware-aware autotuning across the deep learning stack.