Grounding-100M Dataset Overview
- Grounding-100M is a multi-modal, multi-granularity corpus constructed by assembling and refining public image, video, and audio datasets for fine-grained grounding tasks.
- It employs a three-stage design, integrating coarse-grained pretraining, fine-grained alignment, and multi-turn instruction tuning to enhance local and global contextual understanding.
- Its construction leverages automated templates and human-in-the-loop augmentation without providing comprehensive summary statistics, emphasizing qualitative diversity.
The dataset constructed for the GroundingGPT model is a multi-modal, multi-granularity corpus generated via a diversified, staged pipeline. It is intended to enable fine-grained grounding across images, video, and audio, addressing the limitations of prior multi-modal LLMs that primarily model global correlations. The data is assembled by adapting and transforming existing public resources, structuring them into training curricula emphasizing both global and local alignment. While the dataset is referenced throughout the training pipeline, there is no formally released dataset called “Grounding-100M,” and no precise figures for example counts, annotation totals, or summary statistics are disclosed. Distribution and access to the constructed corpus, along with associated code and models, is provided through the GroundingGPT project homepage.
1. Data Provenance and Availability
The authors utilize and reformat a wide range of public image, video, and audio datasets for model training. All stages of the data pipeline involve leveraging either readily available multi-modal corpora or established grounding datasets, including but not limited to:
- LLaVA-Pretrain-595k (≈595,000 image–text pairs)
- Valley-Pretrain-703k (≈703,000 video–text pairs)
- WavCaps audio–text collection (size not specified)
- RefCOCO, RefCOCO+, RefCOCOg, Visual Genome (region grounding)
- DiDeMo, HiREST, Charades-STA, VGGSS (temporal grounding/video and sound localization)
- Flickr30K Entities, VCR, ActivityNet Captions, Clotho, as well as existing instruction banks (e.g., LLaVA-v1.5-mix-665k, Valley-Instruct-73k, Videochat-Instruct-11k).
No consolidated download link or dataset name is provided aside from the public statement that “code, dataset, and model [will be made] publicly available” at https://lzw-lzw.github.io/GroundingGPT.github.io/. There is no mention of a unified “Grounding-100M” dataset, nor are any aggregate sample or annotation counts tabulated (Li et al., 2024).
2. Three-Stage Multi-Granularity Design
The dataset organization and annotation are governed by a three-stage granularity framework:
- Stage 1: Coarse-Grained Multi-Modal Pretraining
- Uses global captions and multi-modal descriptions sampled from LLaVA, Valley, and WavCaps pretraining corpora. Encoders are frozen; only adapters are updated at this stage.
- Stage 2: Fine-Grained Alignment
- Converts grounding and referring datasets to fine-grained dialog pairs that focus on region or timestamp localization. Each dialog consists of a single turn and contains explicit localization, such as textual bounding box “[x1,y1,x2,y2]” or temporal segment “{t₁,t₂}.”
- Stage 3: Multi-Granularity Instruction Tuning
- Integrates both coarse and fine-grained data through a mix of single- and multi-turn dialogs. Generation is based on sampling in-context examples from grounding datasets and augmenting with system-prompted GPT-3.5 completions. Standard instruction corpora are also incorporated to maintain balanced coverage of global and local comprehension.
No summary statistics, ratios, or percentages are provided for the distribution among granularities or across modalities.
3. Annotation Protocols and Construction Pipeline
The construction pipeline for the dataset includes several automated and manual components:
- Template-driven Annotation (Stage 2): GPT-3.5 is prompted using a curated question pool specific to grounding tasks. For each example, one template is selected at random, and placeholders (such as “<region>,” “<exp>,” “<time>,” “<event>”) are filled with ground-truth information from the source dataset. Post-generation filtering removes malformed outputs.
- Human-in-the-Loop Augmentation (Stage 3): Annotators provide several in-context examples, which guide GPT-3.5 to synthesize new multi-turn dialogs emphasizing both local (grounded) and global (summarization) understanding.
- Sampling and Mixing: Across all stages, data is blended using a sampling scheme that weights the current and previous stage corpora according to the loss function:
where in Stage 1 and increases in subsequent stages.
No new statistical annotation metrics or quality markers are defined. Filtering is limited to rejecting malformed or nonconforming outputs.
4. Sources, Tasks, and Annotation Formats
The dataset sources are task- and stage-specific, with each contributing different types of annotation:
| Stage | Source Datasets | Annotation Type |
|---|---|---|
| 1 | LLaVA, Valley, WavCaps | Global captions/descriptions |
| 2 | RefCOCO, Visual Genome, DiDeMo, Charades-STA, HiREST, VGGSS | Region/timestamp labels (“[x1,y1,x2,y2]”, “{t₁,t₂}”) |
| 3 | Flickr30K, VCR, ActivityNet, Clotho, instruction banks | Mixed (single-/multi-turn, global/local) |
All annotation for localization in Stage 2 utilizes explicit textual representations of image bounding boxes or video/audio time intervals. Instruction tuning in Stage 3 interleaves summative and fine-grained queries to foster multi-granular interpretability.
5. Quantitative Summaries and Metrics
The construction process is not accompanied by publication of aggregate statistics such as total sample counts, per-modality breakdown, annotation length distributions, bounding box area means or variances, or temporal segment duration summaries. No new dataset quality or diversity metrics (e.g., class entropy, per-class balance) are introduced or reported. The only reported formula governs the sampling-based training loss with coefficient as described above.
The absence of tabulated statistics is noted; this suggests a primary emphasis on procedural diversity and qualitative annotation coverage rather than quantified dataset composition. A plausible implication is that researchers must inspect the public data and scripts post-release for precise details.
6. Access, Scope, and Limitations
The dataset, code, and model are to be made available at the GroundingGPT project homepage. The resource is constructed from and relies upon existing public datasets for both captions and explicit grounding tasks. There is no indication of a single, unified dataset file, and no officially named “Grounding-100M” corpus is released or described as such in the literature. No annotation benchmarking, coverage metrics, or summary tables are presented.
In summary, the GroundingGPT dataset pipeline exemplifies an on-the-fly, instruction-driven assembly of multi-modal, multi-granularity corpora aimed at fine-grained visual, temporal, and auditory grounding. Its contributions are procedural and integrative rather than statistical or archival, and its distribution is realized via code release rather than a static large-scale dataset (Li et al., 2024).