GranD: Pixel-Level Vision-Language Dataset
- GranD is a large-scale vision-language dataset featuring 11M images and over 810M pixel-level region annotations, setting a new standard for multimodal research.
- The dataset employs a four-stage automated pipeline that extracts object attributes, relationships, scene graphs, and dense captions from each image.
- GranD supports diverse tasks including segmentation, captioning, grounded conversation, and phrase grounding through its robust, high-resolution annotations.
The Grounding-anything Dataset (GranD) is a large-scale, richly annotated vision–language dataset designed to support dense, pixel-level grounding of natural language in natural images. Developed to enable and benchmark models for visually grounded conversation, segmentation, captioning, and other fine-grained multimodal tasks, GranD leverages automated pipelines based on state-of-the-art vision and LLMs to provide region-level annotations, dense captions, relationship graphs, and contextual information for millions of images. Its scale, granularity, and automated construction distinguish it as a foundational resource for research at the intersection of computer vision and language modeling (Rasheed et al., 2023).
1. Dataset Construction Pipeline
GranD is constructed by annotating 11 million images (sourced from the SA-1B "Segment Anything" dataset) through a hierarchical, four-stage pipeline. Each stage employs specialized off-the-shelf models for distinct annotation tasks and uses the output of prior stages as input.
- Level 1: Object Localization & Attributes
- Multiple object detectors (Co-DETR, EVA-02, OWL-ViT, POMP) are run on each image.
- Class-agnostic non-maximum suppression merges output bounding boxes, retaining boxes only if at least two detectors agree ().
- For each region , category and attribute labels (, ) are generated using GRiT or GPT4RoI, and depth is estimated with MiDAS.
- Each is matched to a panoptic instance mask from the SA-1B pool using matching.
- Output: List of detected regions with associated masks, categories, attributes, and depth.
- Level 2: Relationships & Landmarks
- Scene descriptions are generated using BLIP-2 or LLaVA, producing a set per image.
- Noun phrases are extracted from 0 via spaCy, and grounded to regions using MDETR.
- Relationship triplets 1 are assembled.
- LLaVA assigns one of four main image landmark categories (indoor, outdoor, transportation, sports) and associated subcategories.
- Level 3: Scene Graph & Dense Captioning
- Hierarchical scene graphs organize objects and relations by depth (foreground 2 midground 3 background).
- Dense region-level captions are generated by prompting Vicuna-13B with the scene graph, requiring at least three alternating text+mask references per caption.
- Automatic verification: object mentions in captions must align with the scene graph, with prompts iteratively modified until all objects are referenced.
- Level 4: Extra Context
- Vicuna-13B is prompted with the scene graph and captions to produce a "beyond-the-image" context paragraph per image (e.g., historical facts, usage, or likely events).
No human-in-the-loop is required beyond prompt engineering; all annotation steps use deterministic rules and model outputs (Rasheed et al., 2023).
2. Coverage, Statistics, and Data Structure
GranD provides annotations at a globally unmatched scale and detail:
| Statistic | Value | Notes |
|---|---|---|
| Images | 11,000,000 | Sourced from SA-1B pool |
| Regions (maskable instances) | 810,000,000 | Pixel-level, each with mask + label |
| Unique concepts | 7,500,000 | Distinct object labels |
| Referring expressions | 84,000,000 | Short queries for instance grounding |
| Short grounded captions | 22,000,000 | Region-level |
| Dense grounded captions | 11,000,000 | Per-image, interleaved with masks |
| Image-caption-mask triads (GCG subset) | 214,000 | For GCG fine-tuning/test/val |
Concept density per image is high, with a mean of approximately 73.6 unique concepts per image. Empirically, the region area distribution is long-tailed: 60% of regions occupy less than 2% of image area, 20% occupy 2–10%, and the remainder exceed 10%. Concept density (regions per 1,000 pixels) averages 0.015.
The taxonomy follows four main landmark categories (Indoor, Outdoor, Transportation, Sports) and their respective subcategories (e.g., Outdoor: Urban, Rural, Natural landscape).
GranD is released under a permissive academic license. Its directory structure is organized as:
images/: original .jpg filesmasks/: per-region binary masks (run-length encoded .png)annotations/: one .json annotation file per image
Annotation schemas aggregate per-object (ID, category, attributes, depth, mask), relationships, region-level dense captions (with phrase–mask links), and image-level context.
3. Annotation Schema and Example
The annotation schema for a single image is structured as follows:
9
Phrase-to-mask associations enable dense, explicit grounding of language to precise pixel regions. Region relations, landmark tags, and scene-level context provide further multimodal structure.
4. Downstream Tasks and Evaluation
GranD supports four pretraining tasks for the Grounded LMM (GLaMM) model (Rasheed et al., 2023):
- Referring-Expression Segmentation: Given a phrase 4, segment the corresponding region in the image.
- Region-Level Captioning: Produce a caption for a queried bounding box or region.
- Image-Level Captioning: Generate a detailed global image caption.
- Grounded Conversation Generation (GCG): Generate a natural language response interleaved with segmentation masks—a dense caption string with each phrase constrained to a corresponding mask.
For GCG, evaluation splits are: validation 2,500 images, test 5,000 images. Metrics include:
- Caption quality: METEOR, CIDEr (n-gram matching)
- Mask quality: AP5 (average precision @ 6), mean IoU (7)
- Mask recall with text match: Recall = correctly grounded phrases / total GT phrases (requires BERTScore 8)
For referring segmentation and captioning tasks, standard benchmarks and metrics (IoU, AP, SPICE) are used, following Visual Genome, Flickr30K, and refCOCO variants.
5. Applications, Extensions, and Significance
GranD’s broad and dense annotation base supports a variety of vision–language tasks:
- Referring Expression Segmentation: Segment any instance referenced by a short query in-language.
- Dense and Region Captioning: Fine-grained image region annotation to enable use cases in retrieval, accessibility, and human-computer interaction.
- Visually Grounded Conversation: Multi-turn, object-referential dialogue where each response is tied to physical image regions.
- Phrase Grounding: Training and evaluating models to map arbitrary text phrases to segmentation masks.
- Generative Vision Tasks: Mask extraction with GLaMM enables conditional inpainting, editing, or diffusion modeling.
The dataset’s scale and annotation richness substantially increase the granularity and coverage available in region-level and pixel-level vision–language pretraining. By releasing 11 million richly annotated images with over 810 million masks, GranD provides a foundation for scalable research into grounded multimodal learning, embodied AI, and interactive image editing tasks (Rasheed et al., 2023).
6. Context and Future Developments
GranD addresses the scarcity of large-scale, pixel-precise grounding datasets necessary for robust vision–LLM pretraining and benchmarking. The automated, model-centric pipeline enables ongoing expansion and adaptation as new detectors and captioning models become available.
A plausible implication is that this approach—using large, diverse synthetic and human-authored detection/caption outputs in an automated framework—may become central in scaling and evaluating future multimodal systems. Ongoing work may extend GranD with more diverse scene types, finer-grained relation annotations, and integration of real-world image sources to further enhance generalization and domain transfer.