MRES-32M: Multi-Granularity RES Benchmark
- MRES-32M is a large-scale unified benchmark for multi-granularity referring expression segmentation, featuring both object- and part-level annotations.
- It employs a sophisticated, model-assisted pipeline using LVLMs, segmenters, and GPT-4 to generate pixel-accurate masks and contextual free-form captions.
- The dataset supports unified evaluation across fine-grained vision-language tasks with high annotation quality ensured by CLIP filtering and rigorous manual audits.
MRES-32M refers to three distinct, large-scale datasets developed for high-impact applications in computer vision, vision-language segmentation, and recommender system evaluation. Each dataset appears under the "MRES-32M" designation as documented in three unrelated research areas: multi-spectral motion estimation systems, multi-granularity referring expression segmentation (RES), and recommendation system evaluation. The following entry systematically delineates the vision-language and RES “MRES-32M,” the largest extant freely-available benchmark for multi-granularity referring expression segmentation, as described in (Wang et al., 2023) and (Liu et al., 2 Apr 2025).
1. Definition and Scope
MRES-32M is a large-scale, unified benchmark for multi-granularity referring expression segmentation (MRES). It comprises 1,000,000 images and 32.2 million pixel-accurate segmentation masks each linked to a free-form natural language expression. Unlike prior datasets which only provide object-level masks, MRES-32M offers both object-level and part-level alignments, enabling training and evaluation of models on a spectrum from single-object references to precise part-level language grounding (Wang et al., 2023, Liu et al., 2 Apr 2025).
The dataset was constructed to address the limitations of preceding RES (e.g., RefCOCO, RefCOCOg, PhraseCut) which rely on bounded object taxonomies, lack systematic part annotations, and deliver only tens of thousands of segmented instances. MRES-32M presents:
- 15.3 million object-level masks and captions
- 16.9 million part-level masks and captions
- 365 object categories and 2,299 part categories
- Approximately 32 masks and expressions per image, mean caption length ≈ 4.6 words
This scale—40× the image count and >200× the instance count of any prior public RES dataset—positions MRES-32M as the preeminent resource for developing and benchmarking models for fine-grained, multi-modal vision–language alignment.
2. Data Construction and Annotation Pipeline
MRES-32M employs a model-assisted automated pipeline to achieve comprehensive annotation coverage at both object and part levels. The pipeline, as formalized in (Wang et al., 2023, Liu et al., 2 Apr 2025), proceeds in several stages:
2.1 Multi-Grained Dense Captioner
A large vision–LLM (LVLM; exemplars: Qwen-VL, MiniGPT-VL) is fine-tuned to operate as a dense captioner across three granularities:
- Image-level (COCO captions)
- Object-level (Visual Genome region-based)
- Part-level (Pascal-Part, PACO, PartImageNet via “PartName_of_ObjectName” templates)
Inputs consist of (image, normalized_bbox) pairs where . Hyperparameters typical for LVLM fine-tuning include AdamW optimizer, lr, batch sizes 64–128, 10–20 epochs per task, with 1–2 warm-up epochs.
2.2 Object-level Mask & Caption Generation
For each Object365 image and annotated object bounding box:
- A promptable segmenter (e.g., SAM) extracts the pixel mask for the bbox.
- The dense captioner generates contextual free-form referring expressions.
- Output tuple: image_id, bbox, mask, caption, object category.
2.3 Part-level Mask & Caption Generation
- An LLM (e.g., GPT-4) is prompted with each object category to enumerate a part vocabulary .
- For every part , an open-vocabulary segmenter (e.g., OVSeg) detects instances; the captioner then produces corresponding part-level referring expressions.
2.4 Quality Control
- CLIP-based filtering: Every (image crop, caption) pair is encoded and retained only if cosine similarity .
- Manual audits: 20%, 50%, and 80% checkpoints on pipeline throughput, with empirical precision on a held-out 1% sample.
3. Structure, Schema, and Access
MRES-32M is released in COCO-inspired JSON format, extended for multi-granularity:
| Field | Description | Example/Notes |
|---|---|---|
| image_id | Unique integer ID for image | 12345 |
| file_name | Filename, JPEG format | "000012345.jpg" |
| height, width | Native image dimensions (typically 1024×1024) | – |
| annotations | List of mask records per image | – |
| id | Annotation ID | 5678 |
| bbox | [x_min, y_min, w, h] | – |
| category_id | Object or part category int ID | 42 |
| part_category_id | Nonzero for part; zero for object | 1076 (light lens) |
| segmentation | COCO RLE, binary mask, or polygon | – |
| caption | Free-form natural language expression | “the left grip” |
| granularity | “object” or “part” | “part” |
Auxiliary files: categories.json (365 objects, 2,299 parts), token-level vocabulary statistics, and per-image referencing statistics. Released under CC BY-NC 4.0 via https://github.com/Rubics-Xuan/MRES.
Standard preprocessing recommends pycocotools for RLE-to-mask conversion, CLIP or HuggingFace-based caption tokenization, and mask normalization to model input resolution.
4. Statistical Analysis and Dataset Properties
- Masks/image: Mean ≈32; sum across images yields 15.3M objects and 16.9M part masks.
- Caption length: Mean 4.6 words (σ≈1.8); Zipfian term distribution (top 100 words ≈45% of tokens).
- Category coverage: 365 coarse object classes, 2,299 LLM-derived part subclasses (e.g., “light lens” for “traffic light”).
- Mask area distribution: Median normalized area ≈0.04; long right tail indicative of dense part labeling.
- Annotation agreement: mIoU = is used for split filtering; CLIP similarity across population before thresholding has 0.
No explicit inter-annotator agreement, as annotation is model-driven, but manual audits confirm high mask–caption match rates.
5. Supported Tasks, Benchmarks, and Evaluation
MRES-32M enables, for the first time at scale:
- Part-Level RES: Segmentation of fine-grained visual parts given natural language expressions.
- Multi-Object RES: Grouped mask segmentation for expressions referencing multiple entities (“all the mugs on the table”).
- Unified and Omni-Level RES: Training and evaluation of models capable of resolving reference at any granularity.
Evaluation metrics (all standard in RES literature):
- IoU_j: 1 for prediction 2.
- mIoU: 3.
- oIoU: 4.
- Generalized RES: Complementary cIoU, gIoU, N-acc.
Baseline benchmarks using UniRES and UniRES++ models on the RefCOCOm test split:
| Metric | val (part) | testA (part) | testB (part) | val (obj+part) | testA (obj+part) | testB (obj+part) |
|---|---|---|---|---|---|---|
| mIoU [%] | 19.6 | 16.4 | 25.2 | 34.3 | 27.8 | 41.7 |
Zero-shot generalization to classic RES (RefCOCO val/testA/testB) following MRES-32M pre-training: 71.2%, 74.8%, 66.0% mIoU (Wang et al., 2023, Liu et al., 2 Apr 2025).
6. Comparison to Prior Datasets
MRES-32M surpasses all prior vision–language and RES datasets in terms of granularity, instance scale, and annotation diversity.
| Dataset | Images | ObjInst | PartInst | ObjCats | PartCats | CapLen | Part-Level | Free-Form NL |
|---|---|---|---|---|---|---|---|---|
| ReferIt | 20k | 97k | – | 238 | – | 3.2 | ✗ | ✓ |
| RefCOCO(+/g) | ~25k | ~150k | – | 80 | – | 3.6/8.4 | ✗ | ✓ |
| PhraseCut | 49k | 219k | – | 80 | – | – | ✗ | Templated |
| Pascal-Part | 19k | 40k | 363k | – | 193 | – | ✓ (visual) | – |
| PACO | 20k | 260k | 641k | 75 | 456 | – | ✓ (tags) | – |
| MRES-32M | 1M | 15.3M | 16.9M | 365 | 2,299 | 4.6 | ✓ (V–L) | ✓ |
In both scale and semantic granularity, MRES-32M is the only published benchmark furnishing part masks with free-form, context-rich referring expressions suitable for training and evaluating unified RES models (Liu et al., 2 Apr 2025).
7. Access and Usage
MRES-32M is publicly available for research and non-commercial purposes under the CC BY-NC 4.0 license from https://github.com/Rubics-Xuan/MRES. Download scripts, COCO-format loaders, auxiliary metadata, and sample code (including pycocotools support) are maintained in the repository. Images are subject to licensing inherited from Object365 and COCO.
Preprocessing steps typical in recent work:
- Mask decoding: pycocotools or equivalent for RLE/polygon transformation.
- Caption tokenization: HuggingFace Tokenizers (e.g., CLIP tokenizer).
- Mask normalization: resizing to model input dimensions.
Recommended best practices include reserving 10% of the corpus for validation, CLIP-based simulation for custom splits, and adherence to the supplied evaluation protocols and split definitions for benchmark comparability.
For information on the unrelated MRES-32M datasets in multi-spectral motion estimation (Dai et al., 2020) and offline recommender evaluation (Smucker et al., 2 Apr 2025), refer to their respective official documentation and publications.