MONeT: Memory Optimized Network Training

Updated 9 December 2025

MONeT is a methodology that minimizes memory consumption during deep neural network training by optimizing write operations and recomputation strategies.
It leverages low-rank gradient accumulation (LRT) and MILP-based checkpointing to address hardware constraints on both emerging NVM devices and GPUs.
Empirical evaluations demonstrate significant memory reductions and improved accuracy, supporting energy-efficient, on-device, and large-scale training.

Memory Optimized Network Training (MONeT) encompasses a class of methodologies designed to minimize memory consumption during the training of deep neural networks, with distinct paradigms developed for both traditional digital accelerators (notably GPUs) and emerging hardware platforms such as resistive non-volatile memory (NVM) devices. MONeT approaches aim to reconcile stringent hardware constraints with the computational demands of network training, primarily by algorithmically reducing the number of memory writes and the total auxiliary memory usage, or by globally optimizing the storage and recomputation of intermediate activations without sacrificing accuracy (Gural et al., 2020, Shah et al., 2020).

1. Problem Statements and Targeted Constraints

Two principal variants of the MONeT paradigm have been proposed in the literature.

a. Emerging NVM-centric MONeT.

In the context of resistive NVM devices, training must address two interrelated hardware-centric constraints:

Low Write Density (LWD): In NVMs such as RRAM, each write operation incurs high energy cost (≈ 10.9 pJ/bit) and degrades endurance (≈ 10^6 cycles maximum). The metric $\rho$ denotes the number of writes per weight cell per sample, with $\rho \ll 1$ as an engineering goal. Batch SGD achieves $\rho = 1/B$ . Any strategy must minimize $\rho$ without incurring prohibitive memory overhead.
Low Auxiliary Memory (LAM): On-chip NVM offers high density, whereas digital memory (SRAM) is area-constrained. Traditional mini-batch SGD requires $O(Bn_on_ib)$ bits of SRAM for a layer with weights $W \in \mathbb{R}^{n_o \times n_i}$ and batch size $B$ . The target is $O(r(n_o+n_i)b)$ auxiliary memory with $r \ll \min(B, n_i, n_o)$ .

b. GPU-centric MONeT for Deep Network Training.

For large-scale models on digital accelerators, the bottleneck arises from the disparity between computation throughput and memory bandwidth/capacity. The objective is to minimize overall memory consumption (with a hard budget $M$ ) during the combined forward and backward passes, while constraining the increase in total computation (maintaining acceptable overhead).

2. Methodologies

a. Low-Rank Training (LRT) for Emerging Memories

The LRT principle reformulates mini-batch gradient accumulation as a sequence of low-rank updates. For each sample $i$ and layer weight $W$ , the rank-1 gradient is $\nabla_W\mathcal{L}^{(i)} = \delta^{(i)} (x^{(i)})^T$ . Instead of storing the full sum $\sum_{i=1}^B \delta^{(i)} (x^{(i)})^T$ , LRT maintains thin matrices $\tilde{V} \in \mathbb{R}^{n_o \times r}$ and $\tilde{U} \in \mathbb{R}^{n_i \times r}$ such that $\tilde{V} \tilde{U}^T \approx \sum_{i=1}^B \delta^{(i)}(x^{(i)})^T$ . Upon each new sample, a rank- $(r+1)$ update is formed and reduced to rank- $r$ via the “Optimal Kronecker Sum” (OK) algorithm, utilizing efficient QR and SVD factorization. Accumulation continues until a threshold is met or the effective batch size $B$ is reached, at which point a weight update is applied in NVM as $W \leftarrow W - \eta (\tilde{V}\tilde{U}^T)$ (Gural et al., 2020).

b. MILP-based Global Memory Optimization

In deep learning frameworks, MONeT formulates the training pipeline as a directed acyclic graph (DAG) $G=(V,E)$ where nodes are operators, each producing activations and consuming parameters. Training under a peak memory limit $M$ and minimal compute time $T$ is cast as a 0–1 MILP over:

Checkpointing decisions ( $s_{i}^k \in \{0,1\}$ ): whether to store each activation $x_i$ at each phase.
Recomputation plans ( $r_{i}^k \in \{0,1\}$ ): whether to recompute $x_i$ for use in backward phase $k$ .
Operator selection ( $\delta_{i,l},\hat\delta_{k,l}$ ): establish operator implementations trading workspace for speed.

The MILP jointly constrains all phases to peak memory $\leq M$ , encodes DAG dependencies, and minimizes total forward, backward, and recomputation time. Solvers such as Gurobi compute optimal checkpointing/recomputation schedules and operator assignments (Shah et al., 2020).

3. Algorithmic Implementations and Workflow Steps

a. LRT: SGD Augmentation and In-place NVM Writes

Key LRT augmentations to SGD include:

Maintenance of orthogonal bases $Q_L \in \mathbb{R}^{n_o \times (r+1)}$ and $Q_R \in \mathbb{R}^{n_i \times (r+1)}$ with updated weighing vector $x \in \mathbb{R}^{(r+1)}$ .
Modified Gram-Schmidt expansion upon each incoming $(x^{(i)}, \delta^{(i)})$ .
Formation of a small $(r+1) \times (r+1)$ matrix $M$ and SVD to yield the optimal unbiased (or biased) rank- $r$ approximation.
The net update $Q_L \operatorname{diag}(\sqrt{x})[1:r] \cdot (Q_R\operatorname{diag}(\sqrt{x})[1:r])^T$ is written to NVM at intervals defined by the effective batch/threshold, directly controlling write density and SRAM usage.

b. MONeT MILP Pipeline

JIT-trace the PyTorch model to assemble the DAG, extract operator outputs, and profile for operation-specific time ( $\tau^l_i$ ) and workspace ( $c^l_i$ ).
Construct and solve MILP, which produces checkpoint and recomputation decisions, and operator implementations for both forward and backward passes.
Emit an “execution plan” scheduling forward storage, aggressive deallocation, and backward (with possible recomputation), achieving target memory $M$ with minimal compute overhead (Shah et al., 2020).

4. Quantitative Results and Empirical Evaluation

a. NVM-Targeted LRT

On a 4-layer CNN for MNIST adaptation:

Online SGD ( $\rho=1$ ) yields $\sim 10^4$ writes per cell for 10k samples with $1$– $2\%$ recovery in accuracy.
LRT ( $r=4$ , $B=10$ or $100$) reduces per-cell writes by $\sim 10^3$ , achieving a $10$– $15\%$ accuracy boost versus pure inference.
Device lifetime and energy per decision improve by $\sim 10^3\,\times$ due to reduced writes.

On ImageNet with ResNet-34 head adaptation (Table 1):

SGD: $+0.9 \pm 0.2\%$ accuracy at $\eta=0.1$ .
Biased LRT $r=4$ : $+5.2 \pm 0.8\%$ at $\eta=0.01$ .
Unbiased LRT $r=4$ : $+8.0 \pm 1.1\%$ at $\eta=0.1$ .
UORO ( $r=1$ ): $+0.4 \pm 0.3\%$ .

LRT delivers $5$– $8\times$ greater accuracy recovery than SGD at $\sim 1/B$ write density (Gural et al., 2020).

b. Memory Optimization in Deep Learning Frameworks

At a fixed $10\%$ runtime overhead:

Model	PyTorch (GB)	Checkmate (GB)	MONeT (GB)
ResNet-50	15.1	8.2	5.7
GoogleNet	14.9	10.5	6.9
UNet	14.3	9.1	5.2
VGG-16	14.1	9.9	5.5
MobileNet-V2	14.5	5.8	4.8

MONeT achieves $2$– $3\times$ reduction in memory over the baseline and $1.2$– $1.8\times$ over Checkmate under matched compute overhead (Shah et al., 2020).

5. Trade-Offs and Hyperparameter Guidance

a. LRT Rank Selection and Bias–Variance

Higher LRT rank $r$ yields lower approximation error (controlled via singular value decay), improving convergence at cost of increased SRAM. Accuracy improves rapidly up to $r \sim 4$ –$8$ (see Figure 1 in (Gural et al., 2020)), with diminishing improvement afterward.
Biased truncation is computationally simpler but introduces bias, working well if max-norm gradient scaling is used. Unbiased truncation (Optimal Kronecker mixing) is recommended for fully-connected or final dense layers.
The LRT learning rate $\eta$ must scale as $\propto \sqrt{B}$ to compensate for effective batch accumulation.
Write threshold $\rho_{\min}$ should be empirically set (e.g., $0.01$) to avoid sub-LSB gradient writes.

b. MILP Solver and Optimization Scope

MILP-based MONeT scales up to $\sim 200$ computational nodes in a few hours; larger graphs require decomposition or heuristics. The runtime plan is static, but generalizes to practical deep networks with complex DAG structure.
All experiments have used full-precision FP32; lower-precision implementations can be incorporated as operator options.

6. Practical Implementation and Limitations

a. Integration into Training Pipelines

MONeT for emerging NVMs can be integrated into edge training loops, reducing NVM writes and auxiliary memory to levels compatible with on-device learning, federated adaptation, and robust drift compensation (Gural et al., 2020).
In deep learning frameworks, application involves a model trace, operator profiling, MILP schedule generation, and runtime plan emission within PyTorch, with scripts available at the referenced repository (Shah et al., 2020).

b. Limitations and Recommendations

MONeT's MILP assumes a static feed-forward graph; dynamic architectures require partitioning or linearization.
Solver times are amortized over repeated usage—once a plan is found, it applies to all subsequent runs with that architecture and memory constraint.
For quantized or low-precision memory or for extremely deep networks, further engineering is required to fully exploit MONeT's memory efficiency.

MONeT methodologies unify algorithmic innovations in low-rank gradient accumulation, global checkpointing, operator selection, and memory footprint minimization by leveraging both mathematical optimization and hardware-conscious training schedules. These approaches enable on-device learning in emerging memory technologies and massively reduce the memory footprint for training with negligible impact on model accuracy and moderate computational overhead. MONeT empirically outperforms previous checkpointing and memory optimization frameworks across a range of deep network architectures (Gural et al., 2020, Shah et al., 2020).

PDF Markdown Chat (Pro)

References (2)

Low-Rank Training of Deep Neural Networks for Emerging Memory Technology (2020)

Memory Optimization for Deep Networks (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Memory Optimized Network Training (MONeT).