Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 40 tok/s Pro

GPT-5 High 38 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 185 tok/s Pro

GPT OSS 120B 465 tok/s Pro

Claude Sonnet 4 30 tok/s Pro

2000 character limit reached

Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining (2505.11293v1)

Published 16 May 2025 in cs.CV

Abstract: Contrastive learning (CL) is a prevalent technique for training embedding models, which pulls semantically similar examples (positives) closer in the representation space while pushing dissimilar ones (negatives) further apart. A key source of negatives are 'in-batch' examples, i.e., positives from other examples in the batch. Effectiveness of such models is hence strongly influenced by the size and quality of training batches. In this work, we propose 'Breaking the Batch Barrier' (B3), a novel batch construction strategy designed to curate high-quality batches for CL. Our approach begins by using a pretrained teacher embedding model to rank all examples in the dataset, from which a sparse similarity graph is constructed. A community detection algorithm is then applied to this graph to identify clusters of examples that serve as strong negatives for one another. The clusters are then used to construct batches that are rich in in-batch negatives. Empirical results on the MMEB multimodal embedding benchmark (36 tasks) demonstrate that our method sets a new state of the art, outperforming previous best methods by +1.3 and +2.9 points at the 7B and 2B model scales, respectively. Notably, models trained with B3 surpass existing state-of-the-art results even with a batch size as small as 64, which is 4-16x smaller than that required by other methods.

Collections

Summary

The paper introduces B3, a smart batch mining strategy that constructs batches with inherently strong negatives using teacher ranking and METIS clustering.
The method achieves state-of-the-art results on the MMEB benchmark by enabling efficient training with smaller batch sizes and reduced compute.
Experimental evaluations demonstrate B3’s effectiveness across multimodal tasks such as image retrieval, classification, and VQA.

Here is a detailed summary of the paper "Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining" (2505.11293).

Contrastive learning (CL) is a widely used technique for training embedding models by minimizing the distance between similar examples (positives) and maximizing the distance between dissimilar examples (negatives) in the representation space. A crucial source of negatives comes from other examples within the same training batch, known as in-batch negatives. The effectiveness of CL heavily depends on the size and quality of these training batches. While mining explicit "hard negatives" from the dataset can improve performance, this process is computationally expensive, especially in multimodal settings involving high-resolution images. Existing multimodal CL methods often avoid extensive hard negative mining across the entire dataset, relying instead on in-batch negatives, synthetic data, or resampling within the batch. This limits their ability to leverage strong negative signals outside the current batch.

The paper proposes Breaking the Batch Barrier (B3), a novel batch construction strategy designed to create high-quality batches that are inherently rich in strong in-batch negatives, thereby reducing or eliminating the need for separate, expensive hard negative mining.

The core methodology of B3 involves an offline preprocessing step on the training dataset:

Teacher Ranking: A pre-trained teacher embedding model is used to calculate similarity scores and rank all examples in the dataset ( $N$ ) for every other example (acting as a query). This produces a rank matrix $R \in \mathbb{N}^{N \times N}$ .
Rank Filtering: To avoid false negatives (semantically similar items incorrectly ranked highly), the top $p$ ranks are excluded for each query. A subset of the next $m$ ranks is selected, focusing on examples considered moderately similar by the teacher model, which are likely strong negatives.
Sparse Graph Construction: The selected ranks form a sparse preference graph $S$ , where edges indicate potential strong negative relationships.
Community Detection: A community detection algorithm, specifically METIS, is applied to this graph to identify clusters (communities) of examples that are mutually strong negatives for each other. METIS optimizes for minimum cuts, maximizing edges within communities. The clustering process aims to group examples that the teacher model considers related but not true positives. The size of these clusters is denoted by $K$ .
Batch Formation: Training batches of size $|B|$ are constructed by sampling examples from $|B|/K$ distinct communities identified in the previous step. This ensures that examples within a batch are likely to be strong negatives for one another.

The paper provides theoretical justification based on minimizing the difference between the global InfoNCE loss (computed over the entire dataset) and the batch InfoNCE loss. The analysis suggests that maximizing the sum of exponentiated similarity scores for the top $K$ in-batch negatives ( $H^K_{B_i,i}$ ) tightens the bound on this difference. The proposed METIS-based clustering approach aims to achieve this by co-locating mutually strong negatives (those with high similarity scores according to the teacher, thus contributing to $H^K_{B_i,i}$ ) within the same batch. An optimal empirical choice of $K$ is suggested, as very large values can increase intra-class covariance, negatively impacting downstream performance based on prior theoretical work.

The paper introduces two variants:

B3: The core batch mining strategy described above, combined with improved prompting for positive examples (see below).
B3++: Extends B3 by optionally incorporating additional hard negatives sampled from the filtered rank matrix $S$ , using a unified probability distribution based on aggregated batch preferences, rather than sampling independently for each query.

An additional contribution is the use of Improved Representation Prompts for positive examples, not just queries, particularly for classification and VQA tasks. This aims to decouple diverse tasks during training and ensure consistent representation encoding.

The methods are evaluated on the Massive Multimodal Embedding Benchmark (MMEB), comprising 36 tasks across retrieval, classification, VQA, and grounding, using Qwen2-VL and InternVL3 backbones (2B and 7B).

Key experimental results demonstrate the effectiveness of B3:

State-of-the-Art Performance: B3++ achieves new state-of-the-art results on the MMEB benchmark, outperforming previous best methods by +2.9 points at the 2B scale and +1.3 points at the 7B scale.
Efficiency at Small Batch Sizes: B3 allows training with significantly smaller batch sizes while maintaining high performance. Models trained with B3 at a batch size of 64 surpass the previous 2B SOTA (LLaVE) trained with its default settings, which typically require larger batches. This is a crucial advantage for training on limited hardware.
Compute Efficiency: B3 (without explicit hard negatives) performs comparably to or better than a random batch baseline augmented with 5 hard negatives per query, while requiring roughly half the training compute.
Retrieval Performance: B3++ also shows strong performance on short and long image caption retrieval datasets (Flickr, COCO, Urban1k), outperforming other baselines.
Ablations: Studies show the importance of the batch mining itself, the empirical tuning of the cluster size $K$ , and the benefit of higher image resolution. Interestingly, using a significantly stronger teacher model did not yield notable performance gains in this framework.

The B3 methodology is implemented as an offline preprocessing step. Teacher ranking is parallelizable ( $\mathcal{O}(N \log N)$ time for sorting ranks), and METIS clustering is approximately linear ( $\mathcal{O}(N)$ ), making the batch mining process efficient and scalable, allowing mined batches to be stored for training.

Limitations include the initial computational cost for the dataset-wide ranking and clustering, though this is a one-time preprocessing cost.

In conclusion, B3 effectively leverages the entire training dataset to curate batches with strong in-batch negatives through graph-based community detection, improving contrastive learning for multimodal embeddings. Its ability to achieve high performance with smaller batch sizes represents a significant practical benefit.