MRES-32M: Multi-Granularity RES Benchmark

Updated 2 April 2026

MRES-32M is a large-scale unified benchmark for multi-granularity referring expression segmentation, featuring both object- and part-level annotations.
It employs a sophisticated, model-assisted pipeline using LVLMs, segmenters, and GPT-4 to generate pixel-accurate masks and contextual free-form captions.
The dataset supports unified evaluation across fine-grained vision-language tasks with high annotation quality ensured by CLIP filtering and rigorous manual audits.

MRES-32M refers to three distinct, large-scale datasets developed for high-impact applications in computer vision, vision-language segmentation, and recommender system evaluation. Each dataset appears under the "MRES-32M" designation as documented in three unrelated research areas: multi-spectral motion estimation systems, multi-granularity referring expression segmentation (RES), and recommendation system evaluation. The following entry systematically delineates the vision-language and RES “MRES-32M,” the largest extant freely-available benchmark for multi-granularity referring expression segmentation, as described in (Wang et al., 2023) and (Liu et al., 2 Apr 2025).

1. Definition and Scope

MRES-32M is a large-scale, unified benchmark for multi-granularity referring expression segmentation (MRES). It comprises 1,000,000 images and 32.2 million pixel-accurate segmentation masks each linked to a free-form natural language expression. Unlike prior datasets which only provide object-level masks, MRES-32M offers both object-level and part-level alignments, enabling training and evaluation of models on a spectrum from single-object references to precise part-level language grounding (Wang et al., 2023, Liu et al., 2 Apr 2025).

The dataset was constructed to address the limitations of preceding RES (e.g., RefCOCO, RefCOCOg, PhraseCut) which rely on bounded object taxonomies, lack systematic part annotations, and deliver only tens of thousands of segmented instances. MRES-32M presents:

15.3 million object-level masks and captions
16.9 million part-level masks and captions
365 object categories and 2,299 part categories
Approximately 32 masks and expressions per image, mean caption length ≈ 4.6 words

This scale—40× the image count and >200× the instance count of any prior public RES dataset—positions MRES-32M as the preeminent resource for developing and benchmarking models for fine-grained, multi-modal vision–language alignment.

2. Data Construction and Annotation Pipeline

MRES-32M employs a model-assisted automated pipeline to achieve comprehensive annotation coverage at both object and part levels. The pipeline, as formalized in (Wang et al., 2023, Liu et al., 2 Apr 2025), proceeds in several stages:

2.1 Multi-Grained Dense Captioner

A large vision–LLM (LVLM; exemplars: Qwen-VL, MiniGPT-VL) is fine-tuned to operate as a dense captioner across three granularities:

Image-level (COCO captions)
Object-level (Visual Genome region-based)
Part-level (Pascal-Part, PACO, PartImageNet via “PartName_of_ObjectName” templates)

Inputs consist of (image, normalized_bbox) pairs where $bbox \in [0,999]^4$ . Hyperparameters typical for LVLM fine-tuning include AdamW optimizer, lr $=1\times10^{-5}$ , batch sizes 64–128, 10–20 epochs per task, with 1–2 warm-up epochs.

2.2 Object-level Mask & Caption Generation

For each Object365 image and annotated object bounding box:

A promptable segmenter (e.g., SAM) extracts the pixel mask for the bbox.
The dense captioner generates contextual free-form referring expressions.
Output tuple: $\{$ image_id, bbox, mask, caption, object category $\}$ .

2.3 Part-level Mask & Caption Generation

An LLM (e.g., GPT-4) is prompted with each object category $X$ to enumerate a part vocabulary $P_X$ .
For every part $p \in P_X$ , an open-vocabulary segmenter (e.g., OVSeg) detects instances; the captioner then produces corresponding part-level referring expressions.

2.4 Quality Control

CLIP-based filtering: Every (image crop, caption) pair is encoded and retained only if cosine similarity $s \geq 0.5$ .
Manual audits: 20%, 50%, and 80% checkpoints on pipeline throughput, with empirical precision $>92\%$ on a held-out 1% sample.

3. Structure, Schema, and Access

MRES-32M is released in COCO-inspired JSON format, extended for multi-granularity:

Field	Description	Example/Notes
image_id	Unique integer ID for image	12345
file_name	Filename, JPEG format	"000012345.jpg"
height, width	Native image dimensions (typically 1024×1024)	–
annotations	List of mask records per image	–
id	Annotation ID	5678
bbox	[x_min, y_min, w, h]	–
category_id	Object or part category int ID	42
part_category_id	Nonzero for part; zero for object	1076 (light lens)
segmentation	COCO RLE, binary mask, or polygon	–
caption	Free-form natural language expression	“the left grip”
granularity	“object” or “part”	“part”

Auxiliary files: categories.json (365 objects, 2,299 parts), token-level vocabulary statistics, and per-image referencing statistics. Released under CC BY-NC 4.0 via https://github.com/Rubics-Xuan/MRES.

Standard preprocessing recommends pycocotools for RLE-to-mask conversion, CLIP or HuggingFace-based caption tokenization, and mask normalization to model input resolution.

4. Statistical Analysis and Dataset Properties

Masks/image: Mean ≈32; sum across images yields 15.3M objects and 16.9M part masks.
Caption length: Mean 4.6 words (σ≈1.8); Zipfian term distribution (top 100 words ≈45% of tokens).
Category coverage: 365 coarse object classes, 2,299 LLM-derived part subclasses (e.g., “light lens” for “traffic light”).
Mask area distribution: Median normalized area ≈0.04; long right tail indicative of dense part labeling.
Annotation agreement: mIoU = $(1/N) \sum_n |P_n \cap G_n| / |P_n \cup G_n|$ is used for split filtering; CLIP similarity across population before thresholding has $=1\times10^{-5}$ 0.

No explicit inter-annotator agreement, as annotation is model-driven, but manual audits confirm high mask–caption match rates.

5. Supported Tasks, Benchmarks, and Evaluation

MRES-32M enables, for the first time at scale:

Part-Level RES: Segmentation of fine-grained visual parts given natural language expressions.
Multi-Object RES: Grouped mask segmentation for expressions referencing multiple entities (“all the mugs on the table”).
Unified and Omni-Level RES: Training and evaluation of models capable of resolving reference at any granularity.

Evaluation metrics (all standard in RES literature):

IoU_j: $=1\times10^{-5}$ 1 for prediction $=1\times10^{-5}$ 2.
mIoU: $=1\times10^{-5}$ 3.
oIoU: $=1\times10^{-5}$ 4.
Generalized RES: Complementary cIoU, gIoU, N-acc.

Baseline benchmarks using UniRES and UniRES++ models on the RefCOCOm test split:

Metric	val (part)	testA (part)	testB (part)	val (obj+part)	testA (obj+part)	testB (obj+part)
mIoU [%]	19.6	16.4	25.2	34.3	27.8	41.7

Zero-shot generalization to classic RES (RefCOCO val/testA/testB) following MRES-32M pre-training: 71.2%, 74.8%, 66.0% mIoU (Wang et al., 2023, Liu et al., 2 Apr 2025).

6. Comparison to Prior Datasets

MRES-32M surpasses all prior vision–language and RES datasets in terms of granularity, instance scale, and annotation diversity.

Dataset	Images	ObjInst	PartInst	ObjCats	PartCats	CapLen	Part-Level	Free-Form NL
ReferIt	20k	97k	–	238	–	3.2	✗	✓
RefCOCO(+/g)	~25k	~150k	–	80	–	3.6/8.4	✗	✓
PhraseCut	49k	219k	–	80	–	–	✗	Templated
Pascal-Part	19k	40k	363k	–	193	–	✓ (visual)	–
PACO	20k	260k	641k	75	456	–	✓ (tags)	–
MRES-32M	1M	15.3M	16.9M	365	2,299	4.6	✓ (V–L)	✓

In both scale and semantic granularity, MRES-32M is the only published benchmark furnishing part masks with free-form, context-rich referring expressions suitable for training and evaluating unified RES models (Liu et al., 2 Apr 2025).

7. Access and Usage

MRES-32M is publicly available for research and non-commercial purposes under the CC BY-NC 4.0 license from https://github.com/Rubics-Xuan/MRES. Download scripts, COCO-format loaders, auxiliary metadata, and sample code (including pycocotools support) are maintained in the repository. Images are subject to licensing inherited from Object365 and COCO.

Preprocessing steps typical in recent work:

Mask decoding: pycocotools or equivalent for RLE/polygon transformation.
Caption tokenization: HuggingFace Tokenizers (e.g., CLIP tokenizer).
Mask normalization: resizing to model input dimensions.

Recommended best practices include reserving 10% of the corpus for validation, CLIP-based simulation for custom splits, and adherence to the supplied evaluation protocols and split definitions for benchmark comparability.

For information on the unrelated MRES-32M datasets in multi-spectral motion estimation (Dai et al., 2020) and offline recommender evaluation (Smucker et al., 2 Apr 2025), refer to their respective official documentation and publications.