PAL-Set: Perceptual Artifact Localization Benchmark
- PAL-Set is a comprehensive dataset featuring 10,168 generated images with precise, per-pixel binary masks to benchmark artifact localization.
- The dataset employs a rigorous annotation protocol with high inter-annotator agreement (κ≈0.82) across diverse generative tasks and models.
- It supports various applications including automatic inpainting, image quality assessment, and cross-model artifact detection for generative systems.
PAL-Set Dataset
PAL-Set refers to multiple large-scale, rigorously annotated datasets in machine learning, each serving as a benchmark in distinct domains: (1) Perceptual artifact localization in image synthesis, (2) Memory-based personalization for dialogue systems, and (3) Robotics pallet detection via 2D rangefinder scans. The most widely referenced instance is the “Perceptual Artifacts Localization Set” (Zhang et al., 2023), designed for fine-grained, per-pixel localization of image synthesis artifacts across generative models and tasks. Separate unrelated datasets using the PAL-Set acronym have also appeared in robotics (Mohamed et al., 2018) and dialogue personalization (Huang et al., 17 Nov 2025). This entry primarily concerns the image synthesis corpus, but contrasts and cross-references all three for unambiguous identification.
1. Conceptual Scope and Purpose
PAL-Set (“Perceptual Artifacts Localization Set”) (Zhang et al., 2023) is constructed to address the need for quantitative, per-region evaluation of image synthesis artifacts. It comprises 10,168 generated images paired with dense, pixel-level binary masks demarcating all “perceptual artifacts”—regions that are visually implausible, unpleasant, or inconsistent with correct real-world content, as identified by expert human raters. The dataset spans ten diverse synthesis tasks, encapsulating both unconditional and complex conditional generation. Its primary aim is to standardize artifact localization training and benchmarking, enabling robust segmentation model development, cross-task analysis, and research into artifact remediation.
2. Composition, Acquisition, and Annotation
Image Sourcing and Task Breakdown
PAL-Set comprises images synthesized by mainstream generative methods, including but not limited to: StyleGAN2, Latent Diffusion Model (LDM), Anyres GAN, Real-ESRGAN, PITI (for inpainting, edge-to-image, and mask-to-image), latent space composition, C-VTON (virtual try-on), and portrait shadow removal pipelines. Each task contributes approximately 1,016 images, for a total of 10,168 examples.
| Task (Model) | Images | Type |
|---|---|---|
| StyleGAN2 | 1,020 | Unconditional |
| LDM (LSUN) | 1,005 | Unconditional |
| Anyres GAN | 1,010 | Unconditional |
| Real-ESRGAN | 1,020 | Super-Resolution |
| PITI (Inpaint) | 1,018 | Inpainting |
| PITI (E2I) | 1,022 | Edge-to-Image |
| PITI (M2I) | 1,023 | Mask-to-Image |
| Latent Comp. | 1,017 | Composition |
| C-VTON | 1,008 | Virtual Try-on |
| Portrait Shadow | 1,025 | Shadow Removal |
Generated images are kept at their native resolutions (either 512×512 or 1024×1024), with each sample accompanied by a precisely registered binary annotation mask.
Annotation Protocols and Quality Control
A “perceptual artifact” is operationalized as any region judged either (a) implausible or visually unpleasant, or (b) readily correctible by an ideal inpainting oracle. Expert annotators employ free-form mask painting tools to segment all such regions, erring on the side of generous coverage. Each pixel is given a binary label (artifact vs. non-artifact).
Quality assurance includes double annotation on a random subset (~500 images). Inter-annotator agreement is computed using Cohen’s κ, yielding κ ≈ 0.82, indicating substantial annotation consistency. A plausible implication is that the dataset reliability for segmentation evaluation is high.
Parallel datasets, notably the “PAL-Set” for personalized dialogue (Huang et al., 17 Nov 2025) and 2D pallet scanning (Mohamed et al., 2018), involve distinct generation and annotation workflows (LLM-driven synthesis or manual scan labeling) and do not overlap with the scope or protocol of the image synthesis artifact dataset.
3. Dataset Format, Structure, and Access
PAL-Set is formatted as a hierarchical file system:
1 2 3 4 5 6 7 8 9 10 11 12 |
/PAL-Set/
/train/
/stylegan2/
stylegan2_0001.png
stylegan2_0001_mask.png
...
/ldm/
/anyres/
...
/val/
/test/
metadata.json |
- Image–mask pairs: Each sample comprises an image (“<task><####>.png”) and a binary mask (“<task><####>_mask.png”).
- Splits: Random 80%/10%/10% division per task into train, val, and test (approx. 8,134/1,017/1,017 samples respectively).
- Metadata: Task-specific JSON files provide model checkpoints, random seeds, and generation parameters. All images are unprocessed prior to annotation.
- Pixel labeling: Masks use the convention 0 (background), 255 (artifact), stored as PNG.
Synthetic dialogue (Huang et al., 17 Nov 2025) and LiDAR scan (Mohamed et al., 2018) PAL-Set instances adopt hierarchical directories with JSONL (logs/dialogues) and MAT/TXT (scans, images, ROIs) formats, respectively—detailed schemas ensure ease of programmatic access.
4. Statistical Characterization and Evaluation Benchmarks
PAL-Set supports both per-image and aggregate statistical analyses:
- Perceptual Artifacts Ratio (PAR):
Mean PAR for tasks varies, e.g., Inpaint (PITI) 0.20±0.09, StyleGAN2 0.12±0.05, Edge-to-Image (PITI) 0.22±0.10.
- Segmentation metrics: Given (predicted artifact pixels) and (ground truth),
Benchmark mean IoU (mIoU) for representative models on the test split:
| Model | StyleGAN2 | LDM | SR | Comp. | |-------------------|-----------|-------|-------|-------| | Patch-Forensics | 9.08% | 1.34% | 9.63% | 2.14% | | PAL4Inpaint | 0.98% | 0.81% |14.42% |15.94% | | Specialist (ours) |35.39% |14.41% |37.44% |25.31% | | Unified (ours) |30.86% |11.92% |38.07% |29.53% |
Specialist models (per-task training) further improve particular tasks relative to unified models.
Generalization trials on unseen models (e.g., StyleGAN3, BlobGAN, VersaDiffusion) report zero-shot mIoU in the 6–25% range, with fine-tuning on 10 images rapidly boosting performance to 20–35% mIoU.
The dialogue and LiDAR PAL-Set instances provide correlational and coverage statistics (dialogue turns/session, scan diversity), but not image segmentation metrics.
5. Algorithmic Applications and Benchmarks
Core use cases demonstrated with PAL-Set include:
- Artifact Segmentation: Supervised training of deep segmentation architectures (Swin-T + UPerNet + FCN) for per-pixel artifact localization.
- Image Restoration: Automated inpainting (LaMa, CoMod-GAN, DALL·E 2) after mask-based artifact localization. A “zoom-in” padding/inpainting/blending pipeline leverages PAL-Set masks. User studies show that retouched outputs are preferred in 6/10 generative tasks (p < 0.05).
- No-reference Image Quality Assessment: The PAR serves as an interpretable scalar IQA score; user agreement with PAR ordering reaches 74.5% (StyleGAN2) and 63.9% (Stable Diffusion), exceeding SOTA blind IQA (58–61%).
- Abnormal Region Detection: Artifact segmenters, trained on PAL-Set, flag rare distractors (e.g., watermarks) in real images with low false-positive rates.
- Cross-Model Transfer: Models trained on PAL-Set demonstrate credible artifact localization on previously unseen architectures and image domains.
In dialogue (Huang et al., 17 Nov 2025), PAL-Set underpins memory-augmented benchmarking with BLEU/GPT-4 metrics, solution selection scores, and Win–Tie–Lose head-to-heads, supporting research on retrieval-augmented systems. In robotics (Mohamed et al., 2018), PAL-Set enables framewise classification, ROI localization, and online tracking.
6. Limitations and Prospective Extensions
PAL-Set’s image synthesis variant is restricted to binary “artifact” masks, without fine-grained artifact type classification. Annotator subjectivity—though mitigated by double-annotation and high κ—may influence mask boundaries, especially for ambiguous cases. A plausible implication is that training highly detailed artifact taxonomies or multi-class segmentation would require additional labeling efforts.
Future extensions could include multi-class or severity ranking masks, expansion to higher-resolution and video synthesis, inclusion of per-pixel confidence, and richer metadata for provenance and synthesis conditions.
Synthetic dialogue PAL-Set (Huang et al., 17 Nov 2025) is limited by lack of real-world user data and the finite diversity of LLM-synthesized personas, while robotics PAL-Set (Mohamed et al., 2018) includes only single-pallet 2D LiDAR scenes and omits raw reflectivity, 3D, or multi-modal cues.
7. Access, Licensing, and Interoperability
PAL-Set images and masks are made available in standard PNG and JSON formats, suitable for direct integration with PyTorch/TensorFlow data pipelines (Zhang et al., 2023). The data and associated code are publicly released; license terms are cited in the repository/LICENSE file, commonly CC BY 4.0. Complete citation information and download links are provided in the official repositories or associated papers.
Distinct PAL-Set datasets in dialogue (Huang et al., 17 Nov 2025) and robotics (Mohamed et al., 2018) are likewise released with open access and detailed format documentation, enabling further benchmarking across vision, interaction, and embodiment research contexts.