Papers
Topics
Authors
Recent
2000 character limit reached

MONeT: Memory Optimized Network Training

Updated 9 December 2025
  • MONeT is a methodology that minimizes memory consumption during deep neural network training by optimizing write operations and recomputation strategies.
  • It leverages low-rank gradient accumulation (LRT) and MILP-based checkpointing to address hardware constraints on both emerging NVM devices and GPUs.
  • Empirical evaluations demonstrate significant memory reductions and improved accuracy, supporting energy-efficient, on-device, and large-scale training.

Memory Optimized Network Training (MONeT) encompasses a class of methodologies designed to minimize memory consumption during the training of deep neural networks, with distinct paradigms developed for both traditional digital accelerators (notably GPUs) and emerging hardware platforms such as resistive non-volatile memory (NVM) devices. MONeT approaches aim to reconcile stringent hardware constraints with the computational demands of network training, primarily by algorithmically reducing the number of memory writes and the total auxiliary memory usage, or by globally optimizing the storage and recomputation of intermediate activations without sacrificing accuracy (Gural et al., 2020, Shah et al., 2020).

1. Problem Statements and Targeted Constraints

Two principal variants of the MONeT paradigm have been proposed in the literature.

a. Emerging NVM-centric MONeT.

In the context of resistive NVM devices, training must address two interrelated hardware-centric constraints:

  • Low Write Density (LWD): In NVMs such as RRAM, each write operation incurs high energy cost (≈ 10.9 pJ/bit) and degrades endurance (≈ 106 cycles maximum). The metric ρ\rho denotes the number of writes per weight cell per sample, with ρ1\rho \ll 1 as an engineering goal. Batch SGD achieves ρ=1/B\rho = 1/B. Any strategy must minimize ρ\rho without incurring prohibitive memory overhead.
  • Low Auxiliary Memory (LAM): On-chip NVM offers high density, whereas digital memory (SRAM) is area-constrained. Traditional mini-batch SGD requires O(Bnonib)O(Bn_on_ib) bits of SRAM for a layer with weights WRno×niW \in \mathbb{R}^{n_o \times n_i} and batch size BB. The target is O(r(no+ni)b)O(r(n_o+n_i)b) auxiliary memory with rmin(B,ni,no)r \ll \min(B, n_i, n_o).

b. GPU-centric MONeT for Deep Network Training.

For large-scale models on digital accelerators, the bottleneck arises from the disparity between computation throughput and memory bandwidth/capacity. The objective is to minimize overall memory consumption (with a hard budget MM) during the combined forward and backward passes, while constraining the increase in total computation (maintaining acceptable overhead).

2. Methodologies

a. Low-Rank Training (LRT) for Emerging Memories

The LRT principle reformulates mini-batch gradient accumulation as a sequence of low-rank updates. For each sample ii and layer weight WW, the rank-1 gradient is WL(i)=δ(i)(x(i))T\nabla_W\mathcal{L}^{(i)} = \delta^{(i)} (x^{(i)})^T. Instead of storing the full sum i=1Bδ(i)(x(i))T\sum_{i=1}^B \delta^{(i)} (x^{(i)})^T, LRT maintains thin matrices V~Rno×r\tilde{V} \in \mathbb{R}^{n_o \times r} and U~Rni×r\tilde{U} \in \mathbb{R}^{n_i \times r} such that V~U~Ti=1Bδ(i)(x(i))T\tilde{V} \tilde{U}^T \approx \sum_{i=1}^B \delta^{(i)}(x^{(i)})^T. Upon each new sample, a rank-(r+1)(r+1) update is formed and reduced to rank-rr via the “Optimal Kronecker Sum” (OK) algorithm, utilizing efficient QR and SVD factorization. Accumulation continues until a threshold is met or the effective batch size BB is reached, at which point a weight update is applied in NVM as WWη(V~U~T)W \leftarrow W - \eta (\tilde{V}\tilde{U}^T) (Gural et al., 2020).

b. MILP-based Global Memory Optimization

In deep learning frameworks, MONeT formulates the training pipeline as a directed acyclic graph (DAG) G=(V,E)G=(V,E) where nodes are operators, each producing activations and consuming parameters. Training under a peak memory limit MM and minimal compute time TT is cast as a 0–1 MILP over:

  • Checkpointing decisions (sik{0,1}s_{i}^k \in \{0,1\}): whether to store each activation xix_i at each phase.
  • Recomputation plans (rik{0,1}r_{i}^k \in \{0,1\}): whether to recompute xix_i for use in backward phase kk.
  • Operator selection (δi,l,δ^k,l\delta_{i,l},\hat\delta_{k,l}): establish operator implementations trading workspace for speed.

The MILP jointly constrains all phases to peak memory M\leq M, encodes DAG dependencies, and minimizes total forward, backward, and recomputation time. Solvers such as Gurobi compute optimal checkpointing/recomputation schedules and operator assignments (Shah et al., 2020).

3. Algorithmic Implementations and Workflow Steps

a. LRT: SGD Augmentation and In-place NVM Writes

Key LRT augmentations to SGD include:

  • Maintenance of orthogonal bases QLRno×(r+1)Q_L \in \mathbb{R}^{n_o \times (r+1)} and QRRni×(r+1)Q_R \in \mathbb{R}^{n_i \times (r+1)} with updated weighing vector xR(r+1)x \in \mathbb{R}^{(r+1)}.
  • Modified Gram-Schmidt expansion upon each incoming (x(i),δ(i))(x^{(i)}, \delta^{(i)}).
  • Formation of a small (r+1)×(r+1)(r+1) \times (r+1) matrix MM and SVD to yield the optimal unbiased (or biased) rank-rr approximation.
  • The net update QLdiag(x)[1:r](QRdiag(x)[1:r])TQ_L \operatorname{diag}(\sqrt{x})[1:r] \cdot (Q_R\operatorname{diag}(\sqrt{x})[1:r])^T is written to NVM at intervals defined by the effective batch/threshold, directly controlling write density and SRAM usage.

b. MONeT MILP Pipeline

  • JIT-trace the PyTorch model to assemble the DAG, extract operator outputs, and profile for operation-specific time (τil\tau^l_i) and workspace (cilc^l_i).
  • Construct and solve MILP, which produces checkpoint and recomputation decisions, and operator implementations for both forward and backward passes.
  • Emit an “execution plan” scheduling forward storage, aggressive deallocation, and backward (with possible recomputation), achieving target memory MM with minimal compute overhead (Shah et al., 2020).

4. Quantitative Results and Empirical Evaluation

a. NVM-Targeted LRT

On a 4-layer CNN for MNIST adaptation:

  • Online SGD (ρ=1\rho=1) yields 104\sim 10^4 writes per cell for 10k samples with $1$–2%2\% recovery in accuracy.
  • LRT (r=4r=4, B=10B=10 or $100$) reduces per-cell writes by 103\sim 10^3, achieving a $10$–15%15\% accuracy boost versus pure inference.
  • Device lifetime and energy per decision improve by 103×\sim 10^3\,\times due to reduced writes.

On ImageNet with ResNet-34 head adaptation (Table 1):

  • SGD: +0.9±0.2%+0.9 \pm 0.2\% accuracy at η=0.1\eta=0.1.
  • Biased LRT r=4r=4: +5.2±0.8%+5.2 \pm 0.8\% at η=0.01\eta=0.01.
  • Unbiased LRT r=4r=4: +8.0±1.1%+8.0 \pm 1.1\% at η=0.1\eta=0.1.
  • UORO (r=1r=1): +0.4±0.3%+0.4 \pm 0.3\%.

LRT delivers $5$–8×8\times greater accuracy recovery than SGD at 1/B\sim 1/B write density (Gural et al., 2020).

b. Memory Optimization in Deep Learning Frameworks

At a fixed 10%10\% runtime overhead:

Model PyTorch (GB) Checkmate (GB) MONeT (GB)
ResNet-50 15.1 8.2 5.7
GoogleNet 14.9 10.5 6.9
UNet 14.3 9.1 5.2
VGG-16 14.1 9.9 5.5
MobileNet-V2 14.5 5.8 4.8

MONeT achieves $2$–3×3\times reduction in memory over the baseline and $1.2$–1.8×1.8\times over Checkmate under matched compute overhead (Shah et al., 2020).

5. Trade-Offs and Hyperparameter Guidance

a. LRT Rank Selection and Bias–Variance

  • Higher LRT rank rr yields lower approximation error (controlled via singular value decay), improving convergence at cost of increased SRAM. Accuracy improves rapidly up to r4r \sim 4–$8$ (see Figure 1 in (Gural et al., 2020)), with diminishing improvement afterward.
  • Biased truncation is computationally simpler but introduces bias, working well if max-norm gradient scaling is used. Unbiased truncation (Optimal Kronecker mixing) is recommended for fully-connected or final dense layers.
  • The LRT learning rate η\eta must scale as B\propto \sqrt{B} to compensate for effective batch accumulation.
  • Write threshold ρmin\rho_{\min} should be empirically set (e.g., $0.01$) to avoid sub-LSB gradient writes.

b. MILP Solver and Optimization Scope

  • MILP-based MONeT scales up to 200\sim 200 computational nodes in a few hours; larger graphs require decomposition or heuristics. The runtime plan is static, but generalizes to practical deep networks with complex DAG structure.
  • All experiments have used full-precision FP32; lower-precision implementations can be incorporated as operator options.

6. Practical Implementation and Limitations

a. Integration into Training Pipelines

  • MONeT for emerging NVMs can be integrated into edge training loops, reducing NVM writes and auxiliary memory to levels compatible with on-device learning, federated adaptation, and robust drift compensation (Gural et al., 2020).
  • In deep learning frameworks, application involves a model trace, operator profiling, MILP schedule generation, and runtime plan emission within PyTorch, with scripts available at the referenced repository (Shah et al., 2020).

b. Limitations and Recommendations

  • MONeT's MILP assumes a static feed-forward graph; dynamic architectures require partitioning or linearization.
  • Solver times are amortized over repeated usage—once a plan is found, it applies to all subsequent runs with that architecture and memory constraint.
  • For quantized or low-precision memory or for extremely deep networks, further engineering is required to fully exploit MONeT's memory efficiency.

MONeT methodologies unify algorithmic innovations in low-rank gradient accumulation, global checkpointing, operator selection, and memory footprint minimization by leveraging both mathematical optimization and hardware-conscious training schedules. These approaches enable on-device learning in emerging memory technologies and massively reduce the memory footprint for training with negligible impact on model accuracy and moderate computational overhead. MONeT empirically outperforms previous checkpointing and memory optimization frameworks across a range of deep network architectures (Gural et al., 2020, Shah et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Memory Optimized Network Training (MONeT).