Effectiveness of SAGE beyond language modeling

Determine the effectiveness of the SAGE (Sign-Adaptive Gradient) optimizer on non-language modalities such as computer vision and on fine-tuning tasks, by evaluating its training stability and performance in those settings.

Background

SAGE is proposed as a memory-efficient optimizer that replaces AdamW in hybrid light-state training by using a Lion-style update direction combined with an O(d) adaptive scale designed to handle sparse, high-variance embedding gradients. The empirical evaluation focuses on pretraining LLaMA-style LLMs on The Pile dataset, demonstrating improved perplexity and reduced memory usage over baselines.

Because the experiments are limited to language modeling on The Pile, it is not established whether SAGE’s adaptive damping mechanism and hybrid design generalize effectively to other domains (e.g., vision) or to fine-tuning scenarios. Establishing cross-modality and fine-tuning effectiveness is necessary to assess SAGE’s broader applicability.

References

Finally, our analysis was confined to language modeling on The Pile dataset. The effectiveness of SAGE on other modalities (e.g., vision) or fine-tuning tasks remains an open question.

SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization  (2604.07663 - Lee et al., 9 Apr 2026) in Limitations