Learn how to use gradient accumulation to train models with large batch sizes when GPU memory is a concern.
This technique is useful when working with limited hardware resources, such as single GPU setups.
Key terms:
Gradient accumulation: A way to virtually increase the batch size during training by computing gradients for smaller batches and accumulating them over multiple iterations before updating the model weights.
Batch size: The number of training examples used in a single iteration of model training.
GPU memory: The amount of memory available on a graphics processing unit (GPU) for storing and processing data during model training.
Tensor sharding: Distributing model weights and computations across different devices to work around GPU memory limitations.
BLOOM: An open-source alternative to GPT-3, a large language model for various tasks such as text classification.