Concept-Aware Batch Sampling (CABS)
- Concept-Aware Batch Sampling (CABS) is a dynamic batch curation paradigm that replaces uniform IID sampling with online, concept-driven filtering to optimize vision–language pretraining.
- It leverages fine-grained concept annotations from the DataConcept dataset and employs a two-stage selection process with variants like CABS-DM for diversity and CABS-FM for frequency.
- Empirical results demonstrate that CABS significantly improves classification and retrieval benchmarks by enhancing long-tail concept exposure and compositional scene diversity.
Concept-Aware Batch Sampling (CABS) is a batch curation paradigm developed for vision–LLM pretraining that replaces uniform IID sampling with online, concept-driven filtering. This approach leverages fine-grained concept annotations to construct batches that optimize for downstream objectives, such as broad concept coverage or scene compositionality. CABS builds upon a web-scale annotated dataset, DataConcept, to facilitate flexible, task-adaptive pretraining and serves as an open-source alternative to proprietary online curation algorithms (Ghosh et al., 25 Nov 2025).
1. DataConcept: Foundation for Concept-Aware Sampling
DataConcept is a web-scale pool comprising 128 million image–text pairs, each annotated with fine-grained object categories and their spatial localizations. The construction pipeline involves:
- Concept Bank Assembly: Merging lists from RAM++ (4,585 tags), V3Det, OpenImages, and other sources yields an open-vocabulary bank with approximately 19,261 candidate concepts after synonym clustering (cosine threshold 0.95), WordNet-based merging, and exclusion of unsafe terms.
- Automated Tagging: Using RAM++, each sample is tagged with a preliminary concept set (confidence ).
- Multi-Resolution Grounding: GroundingDINO is executed at four resolutions , seeding only tags, with bounding boxes retained if box-score and text-alignment both exceed $0.27$. Weighted Box Fusion (WBF) and optional post-NMS pruning (IoU ) are applied.
- Concept-Aware Recaptioning: Qwen2-VL-7B generates recaptions conditioned on both the alt-text and detected concept set . These recaptions () achieve higher coverage (51% exact concept-match vs. 3.9% in raw alt-text) and significantly greater length (median 34 words vs. 6).
- Statistics: DataConcept yields a final concept vocabulary of 12,253 unique concepts, approximately 486 million bounding boxes (mean $3.8$/image), and an extremely long-tailed concept count distribution (median frequency , tail up to 21 million).
All annotations—including tags, bounding boxes, confidences, and recaptions—are publicly released.
2. Formal Definition and Core Variants of CABS
CABS replaces IID mini-batches with dynamically filtered, concept-aware subbatches during pretraining. The process is formalized as follows:
- Two-Stage Batch Selection:
- Draw a "superbatch" uniformly at random from the pool.
- Select a subbatch of size , filtering fraction of samples using a sample-scoring heuristic based on concept metadata for each sample.
- Diversity Maximization (CABS-DM): Suitable for classification objectives, CABS-DM constructs batches with broad concept coverage. For each concept , running counts and per-concept caps are maintained. The marginal gain for sample is:
where is the pool frequency of , prioritizes under-represented concepts, and upweights rare concepts.
- Frequency Maximization (CABS-FM): For retrieval tasks, batches are curated purely by scene object multiplicity. The scoring heuristic is ; the top- samples with highest concept count per image are selected.
- Probabilistic Formulation: Both variants can be interpreted as assigning each sample a weight and sampling batches proportionally to .
3. Algorithmic Procedures and Data Structures
Efficient implementation of CABS is enabled by tailored data structures and stepwise procedures:
1 2 3 4 5 6 |
def BUILD_CABS_BATCH(superbatch_S, filter_ratio_f, heuristic_h, parameters_θ): B = len(S) b = int((1 - f) * B) scores = [h(C_i, θ) for i in S] I = indices_of_top_b_scores(scores) return [S[i] for i in I] |
- CABS-DM: Uses an incrementally updated max-heap over candidates, updating in per sample and repeating iterations.
- CABS-FM: Executes a single sort by concept multiplicity.
Typical hyperparameters: superbatch size , subbatch size , filter ratio (retains 20%), concept caps chosen so , smoothness temperature , and pre-stored concept lists per sample ($3-5$ concepts/image).
Key data structures include , a max-heap for gains, and an inverted concept-to-sample index.
4. Theoretical Properties and Computational Complexity
CABS-DM optimizes a submodular coverage function blended with a rarity term. The greedy Top-K selection guarantees a -approximation to the optimal submodular coverage objective. Per-iteration computational complexity is:
- CABS-DM: (worst case).
- CABS-FM: .
Space complexity is total number of concept annotations in .
A plausible implication is that CABS-DM benefits model generalization due to improved long-tail concept exposure, while CABS-FM enhances retrieval by focusing on compositional scene variety.
5. Empirical Results and Comparative Evaluation
CABS was evaluated by pretraining CLIP and SigLIP models on DataConcept (128M samples) and testing on 28 zero-shot classification and 2 image–text retrieval benchmarks.
Classification Performance (CABS-DM):
| Model | Sampling | ImageNet Top-1 | Avg Top-1 (25 DS) | Let-It-Wag! |
|---|---|---|---|---|
| CLIP ViT-B/32 (alt-text) | IID | 17.3% | 28.2% | 5.1% |
| CLIP ViT-B/32 | CABS-DM | 21.9% (+4.6%) | 30.7% (+2.5%) | 7.5% (+2.4%) |
| SigLIP ViT-B/16 | IID | 17.2% | 26.4% | — |
| SigLIP ViT-B/16 | CABS-DM | 24.1% (+6.9%) | 30.9% (+4.5%) | — |
Image–Text Retrieval (CABS-FM):
| Model | Sampling | Recall@1 (COCO+Flickr) |
|---|---|---|
| CLIP ViT-B/32 | IID | 12.9% |
| CLIP ViT-B/32 | CABS-FM | 16.4% (+3.5%) |
| SigLIP ViT-B/16 | IID | 15.0% |
| SigLIP ViT-B/16 | CABS-FM | 18.1% (+3.1%) |
Training on synthetic recaptions, rather than raw alt-text, yields further increases (+9.0% and +4.6%).
CABS-DM distinctly outperforms MetaCLIP (offline concept balancing) by +3.8% (ImageNet) and +1.8% (Let-It-Wag), while GRIT-VLP / MAFA (hard-negative sampling) rarely surpass IID sampling.
Filter ratio ablations demonstrate optimal results at , and switching from IID to CABS mid-training gives an additional +3–4% gain.
6. Practical Recommendations and Limitations
- CABS-DM is recommended for single-object tasks to ensure uniform concept representation and extend generalization, particularly in long-tail distributions.
- CABS-FM is preferable for compositional or multi-object retrieval tasks, biasing batches toward scenes with high object multiplicity.
- Recaptioned data () should be used for both classification and retrieval, offering an empirical boost of 4–12%.
- Limitations:
- Up-front concept annotation (RAM++, GroundingDINO, WBF) introduces additional cost, though this can be amortized across experiments.
- Runtime overhead increases with superbatch size and filter ratio ; is recommended.
- No current evaluation of scalability to larger architectures (e.g., ViT-L) or extremely long pretraining regimes ( tokens).
Potential extensions include hybrid multi-objective batch sampling, curriculum scheduling (interpolating between DM and FM), adaptive concept caps, and integration of concept difficulty measures (such as detection confidence) into scoring heuristics.
7. Contextual Significance and Future Directions
CABS and DataConcept together advance transparent, reproducible, and effective online data curation for vision–LLM pretraining. By defining customizable concept distributions at batch level, practitioners can flexibly steer pretraining objectives toward either broad coverage or high compositionality, depending on task requirements. The methodology provides clear gains over both static, concept-agnostic curation and established concept-balancing or hard-negative sampling baselines. Further research may explore algorithmic enhancements, integration with ultra-large architectures, or adaptive curricula over longer pretraining budgets (Ghosh et al., 25 Nov 2025).