Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 34 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 208 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

MARS: Unleashing the Power of Variance Reduction for Training Large Models (2411.10438v2)

Published 15 Nov 2024 in cs.LG, math.OC, and stat.ML

Abstract: Training deep neural networks--and more recently, large models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or LLMs. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin. The implementation of MARS is available at https://github.com/AGI-Arena/MARS.

Summary

The paper introduces MARS by combining stochastic recursive momentum with preconditioned gradient methods to reduce variance during neural network training.
It demonstrates significant improvements over AdamW in training GPT-2 models, achieving lower validation losses with fewer training tokens.
The approach paves the way for more efficient AI training by enhancing convergence speed and stability in large-scale model optimization.

An Expert Overview of "MARS: Unleashing the Power of Variance Reduction for Training Large Models"

The paper "MARS: Unleashing the Power of Variance Reduction for Training Large Models" presents a novel optimization framework designed to enhance the efficiency of training large neural networks, specifically targeting the areas where traditional adaptive gradient methods such as Adam and AdamW show limitations. The authors propose a method known as MARS (Make Variance Reduction Shine) to integrate variance reduction techniques with preconditioned gradient methods, demonstrating this approach on large-scale LLMs.

Introduction and Need

Traditional adaptive gradient methods like Adam and AdamW are widely used due to their capability to dynamically adjust learning rates across different parameters, facilitating efficient convergence. However, these methods are inherently challenged by high stochastic gradient variance, which is particularly evident in large-scale LLM training. Given the stochastic nature of LLM data, the integration of variance reduction techniques represents a significant opportunity to mitigate these deviations and enhance convergence properties.

The MARS Framework

MARS positions itself as a unified approach that incorporates stochastic recursive momentum (similar to STORM) into adaptive gradient methodologies. The framework rests on two foundational elements: a scaled stochastic recursive momentum (SSR momentum) that aids in realizing a variance-reduced gradient estimator, and a preconditioned gradient update approximating the second-order optimization benefits akin to Newton's method. Such a combination is theorized to yield improvements in both convergence speed and stability during training phases.

Specific Instantiations

The paper explores three specific implementations under the MARS framework:

MARS-AdamW: This variant combines the standardized preconditioner of AdamW with SSR momentum, showing a noticeable improvement over regular AdamW by adjusting the scaling parameters within the variance reduction function.
MARS-Lion: Leveraging the Lion optimizer's improvement on AdamW via a simpler preconditioning mechanism, MARS-Lion aligns variance reduction with the gradient update methodology embedded in Lion, showing compatibility across such frameworks.
MARS-Shampoo: Extends the preconditioning strategies to full matrix approximations as seen in Shampoo, positioning variance-reduced gradient estimation in the eigenspace preconditioner framework.

Experimental Results

Significant advancements were reported in experiments involving fine-tuning GPT-2 models using the OpenWebText dataset. MARS consistently outperformed AdamW by achieving lower validation losses within a reduced number of training tokens. For instance, MARS achieved a validation loss of 2.53 for GPT-2 large, compared to 2.56 for AdamW, when evaluated within the bounds of 27 billion training tokens—a metric representing a substantial improvement in training efficiency and model performance. Additionally, MARS demonstrated superior accuracy on the Hellaswag downstream task.

Implications and Future Outlook

MARS introduces a strategic shift in how modern variances in stochastic optimization are addressed within large model training. By smartly fusing preconditioned gradient methods with variance reduction, the framework advances the computational efficiency and practical application of adaptive optimizers in AI, especially in contexts requiring extensive model training and varying data distributions.

Looking forward, this approach posits additional research opportunities in the refinement of variance-controlled parameters, potential scaling adaptations across larger architectures beyond those tested, and further integration with upcoming state-of-the-art models. As AI continues to scale, the principles and methodologies introduced in MARS pave a path toward continued innovation in optimization strategies.