Concept-Aware Batch Sampling (CABS)

Updated 27 November 2025

Concept-Aware Batch Sampling (CABS) is a dynamic batch curation paradigm that replaces uniform IID sampling with online, concept-driven filtering to optimize vision–language pretraining.
It leverages fine-grained concept annotations from the DataConcept dataset and employs a two-stage selection process with variants like CABS-DM for diversity and CABS-FM for frequency.
Empirical results demonstrate that CABS significantly improves classification and retrieval benchmarks by enhancing long-tail concept exposure and compositional scene diversity.

Concept-Aware Batch Sampling (CABS) is a batch curation paradigm developed for vision–LLM pretraining that replaces uniform IID sampling with online, concept-driven filtering. This approach leverages fine-grained concept annotations to construct batches that optimize for downstream objectives, such as broad concept coverage or scene compositionality. CABS builds upon a web-scale annotated dataset, DataConcept, to facilitate flexible, task-adaptive pretraining and serves as an open-source alternative to proprietary online curation algorithms (Ghosh et al., 25 Nov 2025).

1. DataConcept: Foundation for Concept-Aware Sampling

DataConcept is a web-scale pool comprising 128 million image–text pairs, each annotated with fine-grained object categories and their spatial localizations. The construction pipeline involves:

Concept Bank Assembly: Merging lists from RAM++ (4,585 tags), V3Det, OpenImages, and other sources yields an open-vocabulary bank $V$ with approximately 19,261 candidate concepts after synonym clustering (cosine threshold 0.95), WordNet-based merging, and exclusion of unsafe terms.
Automated Tagging: Using RAM++, each sample $(I_i, T_i)$ is tagged with a preliminary concept set $C^0_i \subset V$ (confidence $>0.75$ ).
Multi-Resolution Grounding: GroundingDINO is executed at four resolutions $\{384, 512, 800, 1000\}$ , seeding only $C^0_i$ tags, with bounding boxes retained if box-score and text-alignment both exceed $0.27$. Weighted Box Fusion (WBF) and optional post-NMS pruning (IoU $>0.50$ ) are applied.
Concept-Aware Recaptioning: Qwen2-VL-7B generates recaptions conditioned on both the alt-text $T_i$ and detected concept set $C_i$ . These recaptions ( $R_i$ ) achieve higher coverage (51% exact concept-match vs. 3.9% in raw alt-text) and significantly greater length (median 34 words vs. 6).
Statistics: DataConcept yields a final concept vocabulary of 12,253 unique concepts, approximately 486 million bounding boxes (mean $3.8$/image), and an extremely long-tailed concept count distribution (median frequency $\approx489$ , tail up to 21 million).

All annotations—including tags, bounding boxes, confidences, and recaptions—are publicly released.

2. Formal Definition and Core Variants of CABS

CABS replaces IID mini-batches with dynamically filtered, concept-aware subbatches during pretraining. The process is formalized as follows:

Two-Stage Batch Selection:

Draw a "superbatch" $S = \{i_1, ..., i_B\}$ uniformly at random from the pool.
Select a subbatch $S_{sub} \subset S$ of size $b = \lfloor (1-f)\cdot B \rfloor$ , filtering $f$ fraction of samples using a sample-scoring heuristic $h$ based on concept metadata $C_i \subset V$ for each sample.

Diversity Maximization (CABS-DM): Suitable for classification objectives, CABS-DM constructs batches with broad concept coverage. For each concept $c$ , running counts $n_c$ and per-concept caps $t_c$ are maintained. The marginal gain for sample $i$ is:

$h_{DM}(i) = \frac{1}{|C_i|} \sum_{c \in C_i} \left[ \frac{t_c - n_c}{t_c} + \frac{1}{F_c} \right] \cdot 1_{n_c < t_c}$

where $F_c$ is the pool frequency of $c$ , $\frac{t_c-n_c}{t_c}$ prioritizes under-represented concepts, and $\frac{1}{F_c}$ upweights rare concepts.

Frequency Maximization (CABS-FM): For retrieval tasks, batches are curated purely by scene object multiplicity. The scoring heuristic is $h_{FM}(i) = |C_i|$ ; the top- $b$ samples with highest concept count per image are selected.
Probabilistic Formulation: Both variants can be interpreted as assigning each sample a weight $w_i = \exp(\beta h(i))$ and sampling batches $S_{sub}$ proportionally to $\prod_{i \in S_{sub}} w_i$ .

3. Algorithmic Procedures and Data Structures

Efficient implementation of CABS is enabled by tailored data structures and stepwise procedures:

def BUILD_CABS_BATCH(superbatch_S, filter_ratio_f, heuristic_h, parameters_θ):
    B = len(S)
    b = int((1 - f) * B)
    scores = [h(C_i, θ) for i in S]
    I = indices_of_top_b_scores(scores)
    return [S[i] for i in I]

CABS-DM: Uses an incrementally updated max-heap over $B$ candidates, updating $n_c$ in $O(|C_i|)$ per sample and repeating $b$ iterations.
CABS-FM: Executes a single $O(B \log B)$ sort by concept multiplicity.

Typical hyperparameters: superbatch size $B=20,480$ , subbatch size $b=4,096$ , filter ratio $f=0.8$ (retains 20%), concept caps chosen so $\sum_c t_c \approx b$ , smoothness temperature $\beta=1$ , and pre-stored concept lists per sample ($3-5$ concepts/image).

Key data structures include $n_c \in \mathbb{N}^{|V|}$ , a max-heap for gains, and an inverted concept-to-sample index.

4. Theoretical Properties and Computational Complexity

CABS-DM optimizes a submodular coverage function blended with a rarity term. The greedy Top-K selection guarantees a $(1-1/e)$ -approximation to the optimal submodular coverage objective. Per-iteration computational complexity is:

CABS-DM: $O(b (|C|_{avg} + \log B)) \approx O(b |C|_{avg} \log B)$ (worst case).
CABS-FM: $O(B \log B + b)$ .

Space complexity is $O(B + |V| +$ total number of concept annotations in $S)$ .

A plausible implication is that CABS-DM benefits model generalization due to improved long-tail concept exposure, while CABS-FM enhances retrieval by focusing on compositional scene variety.

5. Empirical Results and Comparative Evaluation

CABS was evaluated by pretraining CLIP and SigLIP models on DataConcept (128M samples) and testing on 28 zero-shot classification and 2 image–text retrieval benchmarks.

Classification Performance (CABS-DM):

Model	Sampling	ImageNet Top-1	Avg Top-1 (25 DS)	Let-It-Wag!
CLIP ViT-B/32 (alt-text)	IID	17.3%	28.2%	5.1%
CLIP ViT-B/32	CABS-DM	21.9% (+4.6%)	30.7% (+2.5%)	7.5% (+2.4%)
SigLIP ViT-B/16	IID	17.2%	26.4%	—
SigLIP ViT-B/16	CABS-DM	24.1% (+6.9%)	30.9% (+4.5%)	—

Image–Text Retrieval (CABS-FM):

Model	Sampling	Recall@1 (COCO+Flickr)
CLIP ViT-B/32	IID	12.9%
CLIP ViT-B/32	CABS-FM	16.4% (+3.5%)
SigLIP ViT-B/16	IID	15.0%
SigLIP ViT-B/16	CABS-FM	18.1% (+3.1%)

Training on synthetic recaptions, rather than raw alt-text, yields further increases (+9.0% and +4.6%).

CABS-DM distinctly outperforms MetaCLIP (offline concept balancing) by +3.8% (ImageNet) and +1.8% (Let-It-Wag), while GRIT-VLP / MAFA (hard-negative sampling) rarely surpass IID sampling.

Filter ratio ablations demonstrate optimal results at $f=0.8$ , and switching from IID to CABS mid-training gives an additional +3–4% gain.

6. Practical Recommendations and Limitations

CABS-DM is recommended for single-object tasks to ensure uniform concept representation and extend generalization, particularly in long-tail distributions.
CABS-FM is preferable for compositional or multi-object retrieval tasks, biasing batches toward scenes with high object multiplicity.
Recaptioned data ( $R_i$ ) should be used for both classification and retrieval, offering an empirical boost of 4–12%.
Limitations:
- Up-front concept annotation (RAM++, GroundingDINO, WBF) introduces additional cost, though this can be amortized across experiments.
- Runtime overhead increases with superbatch size $B$ and filter ratio $f$ ; $f \leq 0.8$ is recommended.
- No current evaluation of scalability to larger architectures (e.g., ViT-L) or extremely long pretraining regimes ( $10^{11}$ tokens).

Potential extensions include hybrid multi-objective batch sampling, curriculum scheduling (interpolating between DM and FM), adaptive concept caps, and integration of concept difficulty measures (such as detection confidence) into scoring heuristics.

7. Contextual Significance and Future Directions

CABS and DataConcept together advance transparent, reproducible, and effective online data curation for vision–LLM pretraining. By defining customizable concept distributions at batch level, practitioners can flexibly steer pretraining objectives toward either broad coverage or high compositionality, depending on task requirements. The methodology provides clear gains over both static, concept-agnostic curation and established concept-balancing or hard-negative sampling baselines. Further research may explore algorithmic enhancements, integration with ultra-large architectures, or adaptive curricula over longer pretraining budgets (Ghosh et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Concept-Aware Batch Sampling Improves Language-Image Pretraining (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Concept-Aware Batch Sampling (CABS).