DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training (2304.08480v1)

Published 17 Apr 2023 in cs.CV, cs.AI, and cs.CL

Abstract: We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach, to reduce the memory consumption of contrastive loss when training contrastive learning models. Our approach decomposes the contrastive loss and its gradient computation into two parts, one to calculate the intra-GPU gradients and the other to compute the inter-GPU gradients. According to our decomposition, only the intra-GPU gradients are computed on the current GPU, while the inter-GPU gradients are collected via all_reduce from other GPUs instead of being repeatedly computed on every GPU. In this way, we can reduce the GPU memory consumption of contrastive loss computation from $\bigO(B^2)$ to $\bigO(\frac{B^2}{N})$, where $B$ and $N$ are the batch size and the number of GPUs used for training. Such a distributed solution is mathematically equivalent to the original non-distributed contrastive loss computation, without sacrificing any computation accuracy. It is particularly efficient for large-batch CLIP training. For instance, DisCo-CLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64 A100 40GB GPUs, compared with the original CLIP solution which requires 128 A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K. The code will be released at https://github.com/IDEA-Research/DisCo-CLIP

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a novel distributed contrastive loss that reduces memory complexity from O(B²) to O(B²/N), enabling large-batch training of CLIP models.
It achieves efficiency by computing intra-GPU gradients locally and aggregating inter-GPU gradients via all_reduce operations while maintaining computational accuracy.
Experiments using 64 A100 GPUs demonstrate improved scalability and zero-shot ImageNet accuracy, highlighting its potential for advanced vision-language pre-training.

DisCo-CLIP: A Distributed, Memory-Efficient Approach to CLIP Training

The presented paper, titled "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training," introduces DisCo-CLIP, an innovative approach to training CLIP-like models using a distributed and memory-efficient methodology. This method is particularly noteworthy for its ability to significantly reduce memory consumption during the computation of contrastive loss, thus facilitating the training of large-batch CLIP models on resources constrained by memory limitations.

Methodology

DisCo-CLIP proposes a decomposition of the contrastive loss calculation, separating it into intra-GPU and inter-GPU components. This separation ensures that intra-GPU gradients are computed on local devices, while inter-GPU gradients are aggregated through all_reduce operations. This strategy reduces the memory required for contrastive loss computations from $O(B^2)$ to $O(\frac{B^2}{N})$ , where $B$ is the batch size and $N$ is the number of GPUs. This efficient memory usage is mathematically equivalent to non-distributed methods, preserving computational accuracy and facilitating scalable training across multiple GPUs.

Results

The paper provides empirical evidence demonstrating the efficacy of DisCo-CLIP. With 64 A100 40GB GPUs, DisCo-CLIP can train a ViT-B/32 model with a batch size of 196,608, a feat unattainable by traditional CLIP implementations which would typically require significantly more GPUs. Additionally, the model trained on LAION-400M achieves a zero-shot classification accuracy of 64.3% on the ImageNet dataset when utilizing a batch size of 65,536, which improves upon results obtained with smaller batch sizes.

Implications and Future Directions

This research has substantial implications for the field of vision-language pre-training. By enabling large-batch training without excessive computational resources, DisCo-CLIP represents a valuable tool for both industry practitioners and academic researchers who may not have access to extensive hardware capabilities. The approach outlined has potential applications in various fields reliant on large-scale contrastive learning, such as image classification, text-image retrieval, and generation models.

Future research could explore the integration of DisCo-CLIP with other memory-saving techniques like GradAccum and checkpointing, or its application to other contrastive learning frameworks beyond CLIP. Moreover, extending DisCo-CLIP to accommodate model parallelism or advancements such as reversible networks could facilitate even larger model training on constrained hardware, pushing the boundaries of what's achievable in distributed collaborative environments.

Overall, DisCo-CLIP sets a precedent for efficient, scalable learning in large-scale models and is a promising advancement addressing the challenges of memory usage in contrastive learning-based AI developments.

PDF Markdown

Related Papers

GitHub

GitHub - IDEA-Research/DisCo-CLIP: Official PyTorch implementation of the paper "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training". (39 stars)

YouTube

Show All Videos