Large-Batch Image-Language Pre-Training

Updated 17 June 2026

The paper introduces innovative loss functions like SigLIP, nCLIP/xCLIP, and AmorLIP to overcome batch-size bottlenecks in image-language pre-training.
Large-batch image-language pre-training is defined by using tens of thousands to millions of image-text pairs to enhance contrastive learning and enable robust zero-shot performance.
Efficient techniques such as random patch masking (FLIP), language-supervised token pruning (ELIP), and grouped batch aggregation reduce memory and communication challenges in distributed systems.

Large-batch image-language pre-training refers to the paradigm of training vision-LLMs with extremely large per-step batch sizes, often involving tens of thousands to over one million image-text pairs per update. This approach is motivated by both theoretical properties of contrastive learning and the practical desire for robust zero-shot transfer, fast convergence, and effective utilization of large-scale web data. Over the last several years, a suite of algorithmic innovations, architectural refinements, and systems-level techniques have been introduced to address the distinctive challenges posed by large-batch regimes—specifically, the computational, memory, and optimization bottlenecks of pairing massive data volume with state-of-the-art transformer architectures.

1. Motivations and Challenges of Large-Batch Regimes

Contrastive objectives, such as CLIP’s InfoNCE loss, require comparing each positive (paired image-text) sample to a large set of negatives (mismatched pairs) per batch. Accurate estimation of the softmax denominators and stable gradient signals are empirically and theoretically contingent on large negative pools; this creates a foundational need for large batch sizes (up to 64,000 for ViT-H/14 in OpenCLIP) in high-quality pre-training (Zhou et al., 2022, Li et al., 2022, Zhai et al., 2023, Sun et al., 25 May 2025). With standard distributed setups, scaling batch size is bottlenecked by:

Memory: Storing all embeddings and gradients across accelerator nodes can easily exceed device limits.
Communication: Full all-gather of representations every step introduces quadratic scaling in message volume.
Optimization: Large updates may cause instability, requiring tuning of $\beta_2$ in AdamW and careful management of momentum and learning rate schedules (Zhai et al., 2023).

This regime is further complicated by web data’s inherent label noise, multimodal ambiguity, and the potential misalignment between easy negative mining and representational robustness.

2. Loss Function Innovations for Batch Efficiency

Early work in large-batch image-language pre-training relied on the symmetric InfoNCE (softmax) objective, but more recent research has introduced loss variants that relax the coupling to global batch size.

Sigmoid Loss (SigLIP): The pairwise sigmoid objective replaces InfoNCE’s cross-batch normalization with independent binary classification per (image, text) pair:

$L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$

As batch size $N$ increases, SigLIP demonstrates better small-batch performance and comparable large-batch scaling, while greatly simplifying distributed computation. Empirically, SigLIP matches standard softmax loss for $N\approx 32\,$ K and outperforms it at smaller $N$ (Zhai et al., 2023).

Non-Contrastive and Hybrid Losses (nCLIP, xCLIP): nCLIP uses cross-modal cluster assignment and entropy regularization, removing explicit negatives and the need for large batches. When fused with contrastive loss in a multi-task xCLIP scheme, batch-size sensitivity is drastically reduced—e.g., xCLIP at $N=1024$ outperforms CLIP at $N=4096$ with $38.5\%$ vs $36.8\%$ zero-shot ImageNet (Zhou et al., 2022).

Amortized Partition Estimation (AmorLIP): Instead of summing over all negatives per batch, AmorLIP amortizes the partition function estimation via a small neural network, learning to predict the log-normalizer for each modality. This removes the batch-size bottleneck inherent to InfoNCE-type losses and permits accurate learning with moderate batches, providing $7\text{–}12\%$ relative accuracy gains and up to $L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$ 0 faster convergence, using only tens of GPUs (Sun et al., 25 May 2025).

3. Vision-Token and Model Efficiency for Larger Batches

Efficient token processing in transformer vision backbones directly enables larger batch sizes under memory constraints. Two principal approaches have been established:

Random Patch Masking (FLIP): FLIP randomly masks away $L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$ 1 of ViT image patches per sample, letting the model process more distinct examples or larger batches within the same memory/compute envelope. Masking $L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$ 2 doubles the batch size, halves step time, and yields zero-shot ImageNet accuracy gains of $L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$ 3 over baseline ( $L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$ 4 vs $L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$ 5). The procedure requires no architecture changes and preserves downstream task versatility (Li et al., 2022).

Language-Supervised Token Pruning (ELIP): ELIP progresses through the ViT layers, pruning and merging attention-selected tokens guided jointly by vision and language [CLS] features. With $L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$ 630\% of vision tokens removed, ELIP yields $L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$ 7 lower epoch walltime, $L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$ 8 $L_{\rm sigmoid} = -\frac{1}{N^2} \sum_{i=1}^N\sum_{j=1}^N \log(\sigma(\tilde y_{ij} s_{ij})).$ 9 point average accuracy drop, and, crucially, frees enough memory to expand per-GPU batch size by up to $N$ 0, recovering (or even improving upon) baseline accuracy in pre-training (Guo et al., 2023).

4. Distributed Batch Aggregation and Communication Reduction

Approaches for decoupling effective batch size from device count and communication bottlenecks are central to scalable large-batch training:

Grouped Batch Aggregation (GBA-ITC, M $N$ 1-Encoder): The GBA-ITC strategy partitions the accelerator cluster into groups exchanging embeddings locally, accumulates gradients over multiple steps, and computes the contrastive loss over these local negative pools. For a batch size of $N$ 2 per GPU and group size $N$ 3, the effective batch size becomes $N$ 4, with $N$ 5 gradient accumulation steps. This reduces communication overhead and device memory requirements, yielding $N$ 6 overall speedup and halving peak memory (e.g., $N$ 7 per A100). M $N$ 8-Encoder leverages this to pretrain with effective batches exceeding $N$ 9, supporting multilingual and billion-scale corpus training (Guo et al., 2024).

Chunked or Ring-Shift Negative Sharing (SigLIP): By distributing the computation of negatives in small chunks or through sequential ring-shifts of embeddings, large batch sizes ( $N\approx 32\,$ 0) are achieved without ever materializing the full batch similarity matrix in memory (Zhai et al., 2023).

5. Empirical Scaling Laws and Best Practices

A convergence of empirical findings indicates saturation of benefit in zero-shot transfer and retrieval performance for batch sizes $N\approx 32\,$ 1K, with negligible gains beyond this scale even up to $N\approx 32\,$ 2M (Zhai et al., 2023). Effective large-batch pre-training also depends on:

Setting optimizer $N\approx 32\,$ 3 for stability at large $N\approx 32\,$ 4.
Using weight decay only on newly initialized (not frozen) components.
Limiting per-device batch chunk size ( $N\approx 32\,$ 5) to maintain memory headroom.
Preferring multi-task and hybrid objectives to reduce batch sensitivity.
Exploiting memory savings from pruning/masking to further increase batch size when hardware-constrained.

A summary of downstream metrics across methods and scaling regimes:

Model/Approach	Batch Size	Zero-Shot INet (%)	Retrieval R@1 (Flickr30K/COCO)	Memory Change	Speedup	Reference
CLIP (baseline)	4096	45.7	73.8 / 59.4	Baseline	1.0×	(Zhou et al., 2022)
FLIP	32K	69.6	89.1 / 75.4	Neutral	2–4×	(Li et al., 2022)
xCLIP	1024	38.5	77.5 / 63.3	+27%	0.77×	(Zhou et al., 2022)
ELIP (+large bat)	232 (ex)	-	+0.1/–0.3 vs base	–20 GB	+10–15%	(Guo et al., 2023)
AmorLIP	2K–32K	+7–12% rel. on 38 tasks	-	+0.5%	+13–30%	(Sun et al., 25 May 2025)
M²-Encoder-10B	131K+	88.5	96.9 / 96.9	–50% peak	+60%	(Guo et al., 2024)
SigLIP	32K	73.4	-	Lower	-	(Zhai et al., 2023)

6. Multimodality and Data Scale: Beyond English, Beyond Millions

The frontier of large-batch image-language pre-training is shaped by expansions in dataset size and diversity (e.g., BM-6B with 6B bilingual pairs (Guo et al., 2024)), and in proxy-task richness. Multi-task frameworks incorporating cross-modal masked language and image modeling (CMIM, CMLM) provide fine-grained alignment and robustness. M $N\approx 32\,$ 6-Encoder demonstrates that such large-scale, bilingual pre-training (effective batch $N\approx 32\,$ 7130K) consistently advances zero-shot top-1 accuracy (88.5% on ImageNet, 80.7% on ImageNet-CN) and retrieval metrics (96.9% MR), setting new benchmarks (Guo et al., 2024).

A plausible implication is that combinations of large-scale, hybrid-criterion pre-training and efficient batch aggregation will further drive cross-lingual and multi-domain generalization, provided communication and optimizer bottlenecks are addressed.

7. Ongoing Limitations and Research Directions

While large-batch regimes have pushed transfer performance and scalability, diminishing returns beyond $N\approx 32\,$ 8K and significant hardware/energy costs remain (Zhai et al., 2023). Open challenges include:

Automating batch size selection based on task phase or compute constraints.
Improving amortization objectives, e.g., via more expressive models or hard-negative mining strategies.
Extending parameter-efficient pre-training to under-resourced languages and domains.
Integrating structured negative pools, dynamic curriculum schedules, or data filtering to further improve sample efficiency and robustness.

Continued research is focused on loss functions that decouple or amortize the negative sampling burden, token pruning mechanisms, and distributed-system techniques that harmonize bandwidth, memory, and compute utilization at cluster scale. Recent works such as AmorLIP suggest that the future of large-batch image-language pre-training will be characterized by learned, modular approximations to global objectives, scalable to ever-larger datasets and modalities (Sun et al., 25 May 2025).