Refine-30K: Dataset for Local Image Refinement
- Refine-30K is a comprehensive dataset that provides finely annotated samples for region-specific image refinement with both reference-based and reference-free modalities.
- It supports local detail restoration through precise segmentation masks, synthesized instructions, and high-resolution imagery for rigorous model training.
- The dataset underpins robust benchmarking with objective metrics and subjective evaluations, advancing fine-grained local image restoration research.
Refine-30K is a large-scale, supervised dataset specifically designed for training and evaluating region-specific image refinement models. Its primary focus is supporting the restoration of fine-grained local details within user-indicated regions while strictly preserving the background in all unedited areas. Refine-30K introduces both reference-based and reference-free modalities with high-resolution, multimodal data and finely annotated object and scribble masks, accompanied by explicit task instructions. It supports systematic benchmarking of fine-grained local image restoration tasks, providing a foundation for precise local refinement performance measurement and region-aware model development (Zhou et al., 8 Apr 2026).
1. Dataset Structure and Modalities
Refine-30K is deliberately constructed to address region-specific image refinement in both reference-guided and reference-free contexts. It contains 30,000 labeled samples split as follows:
| Subset | #Samples | Reference Image | Mask Type | Instruction Source |
|---|---|---|---|---|
| Reference-based | 20,000 | Yes | Scribble mask M | VLM description |
| Reference-free | 10,000 | No | Scribble mask M | VLM description |
Each sample is multi-component:
- Reference-based: , where is a degraded input image, a reference image depicting the same subject, the clean ground truth, a binary scribble mask (), and a VLM-generated text instruction.
- Reference-free: , omitting the external reference image but retaining all other modalities.
Masks strictly delimit the editable region. A bounding box , expanded by a fixed margin (0), is used for on-the-fly crop-and-resize during model training and inference, but the raw dataset contains full-image masks only.
2. Data Collection and Annotation Pipeline
Refine-30K employs distinct but principled pipelines for reference-based and reference-free samples:
- Reference-based (20K):
- Cross-image grounding: Given 1 pairs, the Gemini3 vision-LLM (VLM) identifies the principal object in 2 and grounds its location in 3 via a bounding box.
- Segmentation: SAM3, guided by the Gemini3 object description and bounding box, produces an object mask 4 in 5.
- Scribble-based degradation: Free-form scribbles are sampled inside a dilation of 6, creating mask 7. Inpainting 8 inside 9 yields degraded input 0, ensuring clean/vandalized pairs that differ only in 1.
- Instruction synthesis: The Gemini3 VLM produces region-specific instructions tied to 2 and reference features (e.g., “Restore the object in the scribble area to match the reference.”).
- Reference-free (10K):
- Object localization: Gemini3 proposes candidate salient boxes and captions in a single image 3; one is sampled at random.
- Segmentation and degradation as above; 4 is defined using SAM3 and scribble sampling, then 5 is generated by inpainting in 6.
- VLM-based validation: Gemini3 ensures the synthetic degradation is both perceptible and semantically valid; faulty instances are discarded.
- Instruction generation: Directly tied to the object description, e.g., “Restore the missing text strokes on the book cover.”
Region sizes vary from highly localized logos (7 image area) to large objects (8). Exact source dataset names and region area distributions are not reported.
3. Data Formats, Resolutions, and Region Encoding
All images (9, 0, 1) are stored at the model’s native resolution—typically 2 as dictated by the VAE encoder. Masks match this resolution exactly. During Focus-and-Refine training and inference, a bounding box 3 with margin 4 is extracted and both crop and mask are resized to 5 before encoding.
Region encoding adopts binary masks 6 generated from SAM3 object segments, augmented with free-form scribble strokes. This design supports flexible, non-rectangular localizations well suited for thin structures, text, logos, or irregular object boundaries.
4. Benchmarks and Performance Evaluation: RefineEval
Refine-30K is released alongside RefineEval, a 67-case benchmark suite (31 reference-based, 36 reference-free) supporting rigorous, multifaceted evaluation:
- For each case, six degraded variants are generated using three inpainting backbones (Flux-fill, SDXL, Qwen-Edit) and two mask permutations, yielding 402 evaluation inputs.
- Reference-based metrics (computed over the region 7):
- MSE: pixelwise mean squared error on the region-of-interest.
- LPIPS: learned perceptual similarity metric.
- VGG: 8 distance in VGG feature space.
- DINO and CLIP: cosine similarity in feature/embedding space.
- SSIM: structural similarity index.
- Background consistency (region 9): MSE, LPIPS, SSIM comparing the output and the original input outside 0.
- Reference-free metrics:
- Subjective LLM-based assessment (Gemini2.5-Pro) across five axes: Visual Quality, Naturalness, Aesthetics, Fine-detail fidelity, and Instruction faithfulness (mean ratings).
Numerical results are averaged over all 402 cases. Subjective and objective metrics together probe both restoration fidelity in 1 and preservation of 2.
5. Dataset Examples and Region Types
Examples provided in the dataset documentation illustrate typical region-localized refinements:
- Reference-based: Restoring a scratched product logo, where 3 targets only the damaged logo region, the degraded input 4 is generated by introducing blur/artifacts, and guidance comes from a true reference image 5.
- Reference-free: Sharpening a blurred street sign, using only in-image semantically grounded instructions.
Masks correspond to natural object shapes blended with scribble strokes. Segmentations often capture structurally meaningful boundaries, e.g., text, logos, or thin structures. Region types thus include broad objects, fine strokes, and semantically complex areas.
6. Applications, Licensing, and Limitations
Refine-30K is intended for:
- Training region-aware diffusion/refinement models requiring strict invariance in non-edited regions.
- Benchmarking and comparing local refinement capabilities, especially for models supporting instruction-driven editing and/or cross-image guidance.
- Extension to video or multi-step refinement by combining framewise Refine-30K labels with temporal consistency constraints.
- Research into advanced region encoding/disambiguation (scribbles, segmentation rather than bounding boxes) and seamless region paste-back.
The overall dataset license is not specified; compliance with upstream image and model licenses is required. The paper does not specify explicit statistics for mask area or object class distribution.
7. Context and Significance in Local Image Editing Research
Refine-30K operationalizes the region-specific image refinement paradigm, a setting contrasted with prior instruction-driven editing which typically targets global or coarse-grained edits and often disrupts unedited regions. Refine-30K’s emphasis on strict background preservation, diverse region scales, and support for both reference-based and open-ended, instruction-only scenarios sets it apart from existing datasets. Its deployment with RefineEval provides, for the first time, benchmarks that decouple edited-region fidelity from background drift.
A plausible implication is that Refine-30K will catalyze advances in high-precision, fine-grained modeling, as evidenced by the benchmarks in "RefineAnything" (Zhou et al., 8 Apr 2026), where the Focus-and-Refine methodological advances build directly upon this dataset, achieving strong improvements relative to prior baselines and demonstrating near-perfect background preservation. The dataset is also suitable for broader use in local image restoration tasks, including structured object repair, fine-detail restoration, and region-aware generative refinement.