Create a Video View Paper

FlashOptim: Halving Training Memory Without Sacrificing Accuracy

This presentation explores FlashOptim, a suite of optimizer kernel transformations that cuts neural network training memory in half while preserving model quality. The talk covers ULP-based floating-point weight splitting, companded 8-bit optimizer state quantization, and empirical results demonstrating 36% peak memory reduction for large language model finetuning with zero impact on convergence or accuracy across vision and language domains.

Script

Training a billion-parameter language model requires 175 gigabytes of memory. The authors of FlashOptim found a way to cut that to 113 gigabytes without losing a single point of accuracy.

The bottleneck is simple arithmetic. Every parameter needs high-precision master weights for stable updates, plus optimizer states to track momentum and variance. This 16-byte-per-parameter cost becomes the hard limit on model size.

FlashOptim attacks this problem with two complementary innovations.

On the left, ULP-based weight splitting replaces expensive FP32 master weights by storing only the rounding error—compressed into 8 bits—alongside the 16-bit weight. On the right, companding reshapes optimizer state distributions before quantization, turning what would be catastrophic precision loss into negligible error.

Without companding, quantizing optimizer states to 8 bits is a recipe for disaster. The authors demonstrate that naive quantization causes models to diverge within the first few steps. Companding is not an optimization—it is the structural requirement that makes aggressive quantization possible.

Applied to Llama-3.1-8B finetuning, FlashOptim delivers a 36% peak memory cut and halves checkpoint size. Per-parameter memory drops from 16 bytes to just 7, or even 5 bytes when gradient release is enabled.

The critical result: across vision and language tasks, FlashOptim training curves match FP32 baselines exactly. No accuracy is lost, no convergence is slowed, and fused kernel implementations ensure the memory savings come with no performance cost.

FlashOptim shines on parameter-dominated models. For workloads where activations dominate memory, the gains are modest unless paired with activation checkpointing. And while robust across tested domains, some edge cases may still require selective use of higher precision for specific layers.

This work does more than save memory—it democratizes access to large-scale training. FlashOptim proves that 16 bytes per parameter is not a fundamental limit, and its modular design stacks with other memory optimizations for compounding benefits.

FlashOptim halves training memory by rethinking how we store what matters. To explore this research further and create your own AI paper presentations, visit EmergentMind.com.