Papers
Topics
Authors
Recent
Search
2000 character limit reached

Refine-30K: Dataset for Local Image Refinement

Updated 14 April 2026
  • Refine-30K is a comprehensive dataset that provides finely annotated samples for region-specific image refinement with both reference-based and reference-free modalities.
  • It supports local detail restoration through precise segmentation masks, synthesized instructions, and high-resolution imagery for rigorous model training.
  • The dataset underpins robust benchmarking with objective metrics and subjective evaluations, advancing fine-grained local image restoration research.

Refine-30K is a large-scale, supervised dataset specifically designed for training and evaluating region-specific image refinement models. Its primary focus is supporting the restoration of fine-grained local details within user-indicated regions while strictly preserving the background in all unedited areas. Refine-30K introduces both reference-based and reference-free modalities with high-resolution, multimodal data and finely annotated object and scribble masks, accompanied by explicit task instructions. It supports systematic benchmarking of fine-grained local image restoration tasks, providing a foundation for precise local refinement performance measurement and region-aware model development (Zhou et al., 8 Apr 2026).

1. Dataset Structure and Modalities

Refine-30K is deliberately constructed to address region-specific image refinement in both reference-guided and reference-free contexts. It contains 30,000 labeled samples split as follows:

Subset #Samples Reference Image Mask Type Instruction Source
Reference-based 20,000 Yes Scribble mask M VLM description
Reference-free 10,000 No Scribble mask M VLM description

Each sample is multi-component:

  • Reference-based: (I,Iref,I,M,y)(I, I^{ref}, I^*, M, y), where II is a degraded input image, IrefI^{ref} a reference image depicting the same subject, II^* the clean ground truth, MM a binary scribble mask ({0,1}H×W\{0,1\}^{H\times W}), and yy a VLM-generated text instruction.
  • Reference-free: (I,I,M,y)(I, I^*, M, y), omitting the external reference image but retaining all other modalities.

Masks MM strictly delimit the editable region. A bounding box B=BBox(M)B = \text{BBox}(M), expanded by a fixed margin (II0), is used for on-the-fly crop-and-resize during model training and inference, but the raw dataset contains full-image masks only.

2. Data Collection and Annotation Pipeline

Refine-30K employs distinct but principled pipelines for reference-based and reference-free samples:

  • Reference-based (20K):
  1. Cross-image grounding: Given II1 pairs, the Gemini3 vision-LLM (VLM) identifies the principal object in II2 and grounds its location in II3 via a bounding box.
  2. Segmentation: SAM3, guided by the Gemini3 object description and bounding box, produces an object mask II4 in II5.
  3. Scribble-based degradation: Free-form scribbles are sampled inside a dilation of II6, creating mask II7. Inpainting II8 inside II9 yields degraded input IrefI^{ref}0, ensuring clean/vandalized pairs that differ only in IrefI^{ref}1.
  4. Instruction synthesis: The Gemini3 VLM produces region-specific instructions tied to IrefI^{ref}2 and reference features (e.g., “Restore the object in the scribble area to match the reference.”).
  • Reference-free (10K):
  1. Object localization: Gemini3 proposes candidate salient boxes and captions in a single image IrefI^{ref}3; one is sampled at random.
  2. Segmentation and degradation as above; IrefI^{ref}4 is defined using SAM3 and scribble sampling, then IrefI^{ref}5 is generated by inpainting in IrefI^{ref}6.
  3. VLM-based validation: Gemini3 ensures the synthetic degradation is both perceptible and semantically valid; faulty instances are discarded.
  4. Instruction generation: Directly tied to the object description, e.g., “Restore the missing text strokes on the book cover.”

Region sizes vary from highly localized logos (IrefI^{ref}7 image area) to large objects (IrefI^{ref}8). Exact source dataset names and region area distributions are not reported.

3. Data Formats, Resolutions, and Region Encoding

All images (IrefI^{ref}9, II^*0, II^*1) are stored at the model’s native resolution—typically II^*2 as dictated by the VAE encoder. Masks match this resolution exactly. During Focus-and-Refine training and inference, a bounding box II^*3 with margin II^*4 is extracted and both crop and mask are resized to II^*5 before encoding.

Region encoding adopts binary masks II^*6 generated from SAM3 object segments, augmented with free-form scribble strokes. This design supports flexible, non-rectangular localizations well suited for thin structures, text, logos, or irregular object boundaries.

4. Benchmarks and Performance Evaluation: RefineEval

Refine-30K is released alongside RefineEval, a 67-case benchmark suite (31 reference-based, 36 reference-free) supporting rigorous, multifaceted evaluation:

  • For each case, six degraded variants are generated using three inpainting backbones (Flux-fill, SDXL, Qwen-Edit) and two mask permutations, yielding 402 evaluation inputs.
  • Reference-based metrics (computed over the region II^*7):
    • MSE: pixelwise mean squared error on the region-of-interest.
    • LPIPS: learned perceptual similarity metric.
    • VGG: II^*8 distance in VGG feature space.
    • DINO and CLIP: cosine similarity in feature/embedding space.
    • SSIM: structural similarity index.
    • Background consistency (region II^*9): MSE, LPIPS, SSIM comparing the output and the original input outside MM0.
  • Reference-free metrics:
    • Subjective LLM-based assessment (Gemini2.5-Pro) across five axes: Visual Quality, Naturalness, Aesthetics, Fine-detail fidelity, and Instruction faithfulness (mean ratings).

Numerical results are averaged over all 402 cases. Subjective and objective metrics together probe both restoration fidelity in MM1 and preservation of MM2.

5. Dataset Examples and Region Types

Examples provided in the dataset documentation illustrate typical region-localized refinements:

  • Reference-based: Restoring a scratched product logo, where MM3 targets only the damaged logo region, the degraded input MM4 is generated by introducing blur/artifacts, and guidance comes from a true reference image MM5.
  • Reference-free: Sharpening a blurred street sign, using only in-image semantically grounded instructions.

Masks correspond to natural object shapes blended with scribble strokes. Segmentations often capture structurally meaningful boundaries, e.g., text, logos, or thin structures. Region types thus include broad objects, fine strokes, and semantically complex areas.

6. Applications, Licensing, and Limitations

Refine-30K is intended for:

  • Training region-aware diffusion/refinement models requiring strict invariance in non-edited regions.
  • Benchmarking and comparing local refinement capabilities, especially for models supporting instruction-driven editing and/or cross-image guidance.
  • Extension to video or multi-step refinement by combining framewise Refine-30K labels with temporal consistency constraints.
  • Research into advanced region encoding/disambiguation (scribbles, segmentation rather than bounding boxes) and seamless region paste-back.

The overall dataset license is not specified; compliance with upstream image and model licenses is required. The paper does not specify explicit statistics for mask area or object class distribution.

7. Context and Significance in Local Image Editing Research

Refine-30K operationalizes the region-specific image refinement paradigm, a setting contrasted with prior instruction-driven editing which typically targets global or coarse-grained edits and often disrupts unedited regions. Refine-30K’s emphasis on strict background preservation, diverse region scales, and support for both reference-based and open-ended, instruction-only scenarios sets it apart from existing datasets. Its deployment with RefineEval provides, for the first time, benchmarks that decouple edited-region fidelity from background drift.

A plausible implication is that Refine-30K will catalyze advances in high-precision, fine-grained modeling, as evidenced by the benchmarks in "RefineAnything" (Zhou et al., 8 Apr 2026), where the Focus-and-Refine methodological advances build directly upon this dataset, achieving strong improvements relative to prior baselines and demonstrating near-perfect background preservation. The dataset is also suitable for broader use in local image restoration tasks, including structured object repair, fine-detail restoration, and region-aware generative refinement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Refine-30K Dataset.