Papers
Topics
Authors
Recent
2000 character limit reached

Concept-Aware Batch Sampling (CABS)

Updated 27 November 2025
  • Concept-Aware Batch Sampling (CABS) is a dynamic batch curation paradigm that replaces uniform IID sampling with online, concept-driven filtering to optimize vision–language pretraining.
  • It leverages fine-grained concept annotations from the DataConcept dataset and employs a two-stage selection process with variants like CABS-DM for diversity and CABS-FM for frequency.
  • Empirical results demonstrate that CABS significantly improves classification and retrieval benchmarks by enhancing long-tail concept exposure and compositional scene diversity.

Concept-Aware Batch Sampling (CABS) is a batch curation paradigm developed for vision–LLM pretraining that replaces uniform IID sampling with online, concept-driven filtering. This approach leverages fine-grained concept annotations to construct batches that optimize for downstream objectives, such as broad concept coverage or scene compositionality. CABS builds upon a web-scale annotated dataset, DataConcept, to facilitate flexible, task-adaptive pretraining and serves as an open-source alternative to proprietary online curation algorithms (Ghosh et al., 25 Nov 2025).

1. DataConcept: Foundation for Concept-Aware Sampling

DataConcept is a web-scale pool comprising 128 million image–text pairs, each annotated with fine-grained object categories and their spatial localizations. The construction pipeline involves:

  • Concept Bank Assembly: Merging lists from RAM++ (4,585 tags), V3Det, OpenImages, and other sources yields an open-vocabulary bank VV with approximately 19,261 candidate concepts after synonym clustering (cosine threshold 0.95), WordNet-based merging, and exclusion of unsafe terms.
  • Automated Tagging: Using RAM++, each sample (Ii,Ti)(I_i, T_i) is tagged with a preliminary concept set Ci0VC^0_i \subset V (confidence >0.75>0.75).
  • Multi-Resolution Grounding: GroundingDINO is executed at four resolutions {384,512,800,1000}\{384, 512, 800, 1000\}, seeding only Ci0C^0_i tags, with bounding boxes retained if box-score and text-alignment both exceed $0.27$. Weighted Box Fusion (WBF) and optional post-NMS pruning (IoU >0.50>0.50) are applied.
  • Concept-Aware Recaptioning: Qwen2-VL-7B generates recaptions conditioned on both the alt-text TiT_i and detected concept set CiC_i. These recaptions (RiR_i) achieve higher coverage (51% exact concept-match vs. 3.9% in raw alt-text) and significantly greater length (median 34 words vs. 6).
  • Statistics: DataConcept yields a final concept vocabulary of 12,253 unique concepts, approximately 486 million bounding boxes (mean $3.8$/image), and an extremely long-tailed concept count distribution (median frequency 489\approx489, tail up to 21 million).

All annotations—including tags, bounding boxes, confidences, and recaptions—are publicly released.

2. Formal Definition and Core Variants of CABS

CABS replaces IID mini-batches with dynamically filtered, concept-aware subbatches during pretraining. The process is formalized as follows:

  • Two-Stage Batch Selection:
  1. Draw a "superbatch" S={i1,...,iB}S = \{i_1, ..., i_B\} uniformly at random from the pool.
  2. Select a subbatch SsubSS_{sub} \subset S of size b=(1f)Bb = \lfloor (1-f)\cdot B \rfloor, filtering ff fraction of samples using a sample-scoring heuristic hh based on concept metadata CiVC_i \subset V for each sample.
  • Diversity Maximization (CABS-DM): Suitable for classification objectives, CABS-DM constructs batches with broad concept coverage. For each concept cc, running counts ncn_c and per-concept caps tct_c are maintained. The marginal gain for sample ii is:

hDM(i)=1CicCi[tcnctc+1Fc]1nc<tch_{DM}(i) = \frac{1}{|C_i|} \sum_{c \in C_i} \left[ \frac{t_c - n_c}{t_c} + \frac{1}{F_c} \right] \cdot 1_{n_c < t_c}

where FcF_c is the pool frequency of cc, tcnctc\frac{t_c-n_c}{t_c} prioritizes under-represented concepts, and 1Fc\frac{1}{F_c} upweights rare concepts.

  • Frequency Maximization (CABS-FM): For retrieval tasks, batches are curated purely by scene object multiplicity. The scoring heuristic is hFM(i)=Cih_{FM}(i) = |C_i|; the top-bb samples with highest concept count per image are selected.
  • Probabilistic Formulation: Both variants can be interpreted as assigning each sample a weight wi=exp(βh(i))w_i = \exp(\beta h(i)) and sampling batches SsubS_{sub} proportionally to iSsubwi\prod_{i \in S_{sub}} w_i.

3. Algorithmic Procedures and Data Structures

Efficient implementation of CABS is enabled by tailored data structures and stepwise procedures:

1
2
3
4
5
6
def BUILD_CABS_BATCH(superbatch_S, filter_ratio_f, heuristic_h, parameters_θ):
    B = len(S)
    b = int((1 - f) * B)
    scores = [h(C_i, θ) for i in S]
    I = indices_of_top_b_scores(scores)
    return [S[i] for i in I]

  • CABS-DM: Uses an incrementally updated max-heap over BB candidates, updating ncn_c in O(Ci)O(|C_i|) per sample and repeating bb iterations.
  • CABS-FM: Executes a single O(BlogB)O(B \log B) sort by concept multiplicity.

Typical hyperparameters: superbatch size B=20,480B=20,480, subbatch size b=4,096b=4,096, filter ratio f=0.8f=0.8 (retains 20%), concept caps chosen so ctcb\sum_c t_c \approx b, smoothness temperature β=1\beta=1, and pre-stored concept lists per sample ($3-5$ concepts/image).

Key data structures include ncNVn_c \in \mathbb{N}^{|V|}, a max-heap for gains, and an inverted concept-to-sample index.

4. Theoretical Properties and Computational Complexity

CABS-DM optimizes a submodular coverage function blended with a rarity term. The greedy Top-K selection guarantees a (11/e)(1-1/e)-approximation to the optimal submodular coverage objective. Per-iteration computational complexity is:

  • CABS-DM: O(b(Cavg+logB))O(bCavglogB)O(b (|C|_{avg} + \log B)) \approx O(b |C|_{avg} \log B) (worst case).
  • CABS-FM: O(BlogB+b)O(B \log B + b).

Space complexity is O(B+V+O(B + |V| + total number of concept annotations in S)S).

A plausible implication is that CABS-DM benefits model generalization due to improved long-tail concept exposure, while CABS-FM enhances retrieval by focusing on compositional scene variety.

5. Empirical Results and Comparative Evaluation

CABS was evaluated by pretraining CLIP and SigLIP models on DataConcept (128M samples) and testing on 28 zero-shot classification and 2 image–text retrieval benchmarks.

Classification Performance (CABS-DM):

Model Sampling ImageNet Top-1 Avg Top-1 (25 DS) Let-It-Wag!
CLIP ViT-B/32 (alt-text) IID 17.3% 28.2% 5.1%
CLIP ViT-B/32 CABS-DM 21.9% (+4.6%) 30.7% (+2.5%) 7.5% (+2.4%)
SigLIP ViT-B/16 IID 17.2% 26.4%
SigLIP ViT-B/16 CABS-DM 24.1% (+6.9%) 30.9% (+4.5%)

Image–Text Retrieval (CABS-FM):

Model Sampling Recall@1 (COCO+Flickr)
CLIP ViT-B/32 IID 12.9%
CLIP ViT-B/32 CABS-FM 16.4% (+3.5%)
SigLIP ViT-B/16 IID 15.0%
SigLIP ViT-B/16 CABS-FM 18.1% (+3.1%)

Training on synthetic recaptions, rather than raw alt-text, yields further increases (+9.0% and +4.6%).

CABS-DM distinctly outperforms MetaCLIP (offline concept balancing) by +3.8% (ImageNet) and +1.8% (Let-It-Wag), while GRIT-VLP / MAFA (hard-negative sampling) rarely surpass IID sampling.

Filter ratio ablations demonstrate optimal results at f=0.8f=0.8, and switching from IID to CABS mid-training gives an additional +3–4% gain.

6. Practical Recommendations and Limitations

  • CABS-DM is recommended for single-object tasks to ensure uniform concept representation and extend generalization, particularly in long-tail distributions.
  • CABS-FM is preferable for compositional or multi-object retrieval tasks, biasing batches toward scenes with high object multiplicity.
  • Recaptioned data (RiR_i) should be used for both classification and retrieval, offering an empirical boost of 4–12%.
  • Limitations:
    • Up-front concept annotation (RAM++, GroundingDINO, WBF) introduces additional cost, though this can be amortized across experiments.
    • Runtime overhead increases with superbatch size BB and filter ratio ff; f0.8f \leq 0.8 is recommended.
    • No current evaluation of scalability to larger architectures (e.g., ViT-L) or extremely long pretraining regimes (101110^{11} tokens).

Potential extensions include hybrid multi-objective batch sampling, curriculum scheduling (interpolating between DM and FM), adaptive concept caps, and integration of concept difficulty measures (such as detection confidence) into scoring heuristics.

7. Contextual Significance and Future Directions

CABS and DataConcept together advance transparent, reproducible, and effective online data curation for vision–LLM pretraining. By defining customizable concept distributions at batch level, practitioners can flexibly steer pretraining objectives toward either broad coverage or high compositionality, depending on task requirements. The methodology provides clear gains over both static, concept-agnostic curation and established concept-balancing or hard-negative sampling baselines. Further research may explore algorithmic enhancements, integration with ultra-large architectures, or adaptive curricula over longer pretraining budgets (Ghosh et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Concept-Aware Batch Sampling (CABS).