• Learn how to use gradient accumulation to train models with large batch sizes when GPU memory is a concern.
  • This technique is useful when working with limited hardware resources, such as single GPU setups.

Key terms:

  • Gradient accumulation: A way to virtually increase the batch size during training by computing gradients for smaller batches and accumulating them over multiple iterations before updating the model weights.
  • Batch size: The number of training examples used in a single iteration of model training.
  • GPU memory: The amount of memory available on a graphics processing unit (GPU) for storing and processing data during model training.
  • Tensor sharding: Distributing model weights and computations across different devices to work around GPU memory limitations.
  • BLOOM: An open-source alternative to GPT-3, a large language model for various tasks such as text classification.


GPT-3 PyTorch Training single GPU multi-GPU Hardware Limitations Tensor Sharding Mixed Precision Training multi-GPU training Training Strategy