Tensor Rematerialization

Updated 25 October 2025

Tensor rematerialization is a memory optimization paradigm that recomputes intermediate activations instead of storing them, trading extra computation for reduced memory consumption.
It employs both offline MILP-based scheduling and dynamic online algorithms such as greedy eviction and sliding-window techniques to efficiently balance compute and memory.
This approach enables training and inference of larger models on resource-constrained environments, benefiting applications from large-scale neural networks to edge devices.

Tensor rematerialization is a resource management paradigm for deep learning systems in which additional computation is intentionally incurred to reduce memory consumption. By selectively “forgetting” and later recomputing intermediate results—typically activations—rematerialization enables training, inference, or tensor algebra on models and datasets that would otherwise exceed hardware memory limits. The concept generalizes classical “checkpointing” methods and now spans static offline optimizations, greedy online algorithms, and dynamic scheduling techniques, adapting to both static and dynamic computation graphs across heterogeneous and constrained environments.

1. Foundations: Definition, Scope, and Formalization

Tensor rematerialization is formalized as the optimization of the trade-off between peak memory usage and extra computation during neural network training or inference. Rather than storing all intermediate tensors required for backward propagation (gradient computation), certain tensors are deallocated during the forward pass and recomputed when needed. This approach generalizes prior checkpointing strategies by encompassing arbitrary computation graphs (DAGs) with nonuniform computation and memory costs (Jain et al., 2019).

Key formal elements:

Decision variables $S_{t,i}$ indicate retention (“checkpointing”) of a tensor at step $t$ ; $R_{t,i}$ indicate recomputation at step $t$ .
The optimization is commonly cast as a mixed-integer linear program (MILP) or quadratic program (MIQP), with linear constraints encoding data dependencies and device-specific memory budgets.
In extended settings, additional variables track device, memory paging, or energy costs.

This general framework applies equally to forward and backward traversals and is not restricted to chain-like graphs (which characterized earlier checkpointing heuristics such as the Griewank and Chen methods).

2. Optimization Methodologies and Algorithmic Design

Tensor rematerialization can be approached through both offline and online scheduling:

Offline (Static) Approaches

MILP Formulation: Checkmate (Jain et al., 2019) encodes the full execution (unrolled over $T$ stages) as an MILP. The objective is $\min \sum_{t,i} C_i \cdot R_{t,i}$ , constraining memory and respecting dependencies, solvable by standard MILP solvers (e.g., Gurobi, COIN-OR).
LP Relaxations and Two-Phase Rounding: For scalability, integrality constraints can be relaxed and the fractional results then rounded deterministically or stochastically to produce near-optimal binary schedules, with empirical approximation factors close to 1.
Constraint Programming with Retention Intervals: Moccasin (Bartan et al., 2023) represents activation storage as retention intervals $[s_v^i, e_v^i]$ , using $O(n)$ integer variables versus $O(n^2)$ in MILP, and solves via cumulative and reservoir constraints in CP solvers; this grants an order-of-magnitude faster solve times on large graphs.

Online (Dynamic) Approaches

Greedy Cache-Based Eviction: DTR (Kirisame et al., 2020) treats tensors as elements in a runtime cache, evicting candidates based on heuristics incorporating staleness, memory footprint ( $m(t)$ ), and recomputation cost ( $c_0(t)$ ), e.g. $h_{e^*}(t) = [c_0(t)+\sum_{t'\in e^*(t)} c_0(t')]/[m(t) s(t)]$ .
Sliding Window Eviction for Contiguity: Coop (Zhang et al., 2023) prioritizes eviction of contiguous memory blocks via a sliding window over address-sorted tensors, minimizing fragmentation and the total recomputation cost for contiguous allocation.
Partitioning and In-Place Recomputation: Partitioning tensors by cost density (compute cost/memory), and safe in-place rematerialization avoid unnecessary fragmentation and redundant computation in dynamic frameworks.

3. Extension to Heterogeneous, Distributed, and Constrained Hardware

Tensor rematerialization is widely adapted to platforms with strict resource limits or heterogeneous resources:

Device-Aware Scheduling: XEngine (Schuler et al., 2022) formulates operator placement across CPUs, GPUs, with binary decision variables $R_{d,t,i}$ for device $d$ , timestep $t$ , operator $i$ , minimizing joint compute and copy costs through MIQP. Valid schedules leverage memory budgets and data transfer costs to distribute computation efficiently.
Distributed Full-Batch Training: SAR (Mostafa, 2021) applies sequential aggregation and rematerialization in GNNs. Each worker processes only local and remote partitions on demand and immediately releases memory after backward computation, achieving per-worker memory usage $M_{worker} \leq (2/N) M_{total}$ (or $(3/N) M_{total}$ , with prefetching).
Edge Device Strategies: POET (Patil et al., 2022) jointly optimizes rematerialization and paging on mobile-class hardware, cast as an MILP where both compute and page-in/page-out energy costs are profiled; constraints maintain correctness and meet global memory/runtime deadlines. Rematerialization is interleaved with DMA-based paging for energy and throughput optimization.

4. Case Studies: Memory-Complexity, Graph Structure, and Application Domains

Empirical studies demonstrate substantial practical benefits:

Checkmate (Jain et al., 2019): Training with batch sizes up to 73% larger (VGG19), or input sizes up to $5.1\times$ greater (U-Net, MobileNet), under fixed memory budgets.
SAR (Mostafa, 2021): Enables exact, full-batch GNN training for graphs with millions of nodes/edges. Memory per worker drops linearly with cluster size, and fused attention kernels further reduce overhead for attention-based GNNs.
POET (Patil et al., 2022): On TX2 edge devices, POET decreases energy usage up to 35% relative to DTR and achieves memory reductions enabling BERT and ResNet-18 training on microcontroller-class memory.
Coop (Zhang et al., 2023): Lowers minimum memory requirement by 25% for large transformer models (2.7B parameter GPT-3) and halves memory fragmentation compared to DTR across eight DNNs.

Additional domain extensions include tensor contraction and algebra:

Tensor-Train Contraction Products (Kisil et al., 2021) reduce contraction complexity from $O(I^{2N-1})$ to $O(I R^2)$ via TT decompositions and diagrammatic manipulation, applicable to large-scale scientific computing and custom tensor network libraries.

5. Innovations in Dynamic and Symbolic Graph Rematerialization

Optimization in dynamic shape graphs presents unique challenges, addressed by:

Symbolic Shape Analysis: BladeDISC++ (Yuan et al., 22 Dec 2024) defines all tensor dimensions as symbolic identifiers (e.g., $@S_0$ , $@S_1$ ), maintaining algebraic relationships so that memory impact ( $expr$ ) can be statically compared and minimized without exact shape knowledge. E.g., $@S_0=12 \times @S_1$ enables $1024 \times @S_0$ to be compared and simplified with $11008 \times @S_1$ .
Compilation-Runtime Stratification: Candidate evictions and recomputation branches are annotated in the compiled graph, with runtime selection of tensors for eviction and regeneration when actual memory exceeds the threshold, ensuring adaptability without pre-padding/bucketing.
Empirical memory consumption on dynamic shape workloads matches or exceeds precise-shape methods.

6. Rematerialization in Large Model Inference and Quantized Caching

Extending tensor rematerialization to LLM inference:

KV Cache Rematerialization: XQuant (Tomar et al., 14 Aug 2025) quantizes and caches layer-wise input activations ( $X$ ), rematerializing Keys ( $K$ ) and Values ( $V$ ) on the fly: $K = X \cdot W_K$ , $V = X \cdot W_V$ . This avoids explicit $K,V$ cache storage, yielding up to $7.7\times$ memory savings with $<0.1$ perplexity degradation.
Cross-Layer Delta Compression: XQuant-CL exploits the similarity of $X$ between layers, quantizing only the difference $\Delta X_i = X_i - X_{i-1}$ and reconstructing via $X_i = X_0 + \sum_{j=1}^i Q(\Delta X_j)$ . Up to $12.5\times$ memory savings at negligible accuracy cost.
The approach leverages hardware trends favoring increased compute capability relative to memory bandwidth and is complementary to other low-bit quantization and pruning techniques.

7. Practical Implications and Future Research Directions

The evolution of tensor rematerialization techniques has driven training and deployment of deep learning models in contexts previously constrained by system memory. Key outcomes include improved memory utilization, reduced compute overhead in dynamic and distributed settings, and enhanced scalability for models spanning edge devices to large-scale clusters.

Ongoing research focuses on:

Integrating hardware-level memory management and co-design for next-generation accelerators.
Developing hybrid algorithms combining rematerialization, paging, quantization, and tensor decompositions.
Extending optimization methodologies (CP, MILP, MIQP, symbolic algebra) to heterogeneous dynamic workloads and adaptive scheduling.
Reducing fragmentation and further automating in-place recomputation.
Empirically aligning online policies to approach offline optimality, exploring delta-quantization trade-offs, and unifying symbolic scheduling frameworks for dynamic shape graphs.

Tensor rematerialization thus remains a central theme in memory-efficient deep learning, bridging theory and practice in resource-constrained computation and large-model deployment.