PartImageNet Dataset

Updated 1 April 2026

PartImageNet is a large-scale dataset that provides dense, per-pixel part segmentation annotations for 158 object classes, enhancing part-aware visual recognition.
It features a balanced distribution across 11 super-categories with detailed part labels to mitigate long-tail effects and support robust transfer learning.
Baseline experiments show improved semantic segmentation, few-shot learning, and syn-to-real transfer performance, illustrating its practical impact in visual recognition.

PartImageNet is a large-scale, high-quality dataset providing per-pixel part segmentation annotations across a broad spectrum of rigid and non-rigid object classes drawn from the ILSVRC-12 (ImageNet) taxonomy. It was designed to fill the deficit of generic object part-annotated data, previously limited to human-centric datasets and a small set of animals, and enables part-aware recognition, segmentation, and low-data learning at unprecedented scale and diversity (He et al., 2021).

1. Dataset Scope and Structure

Classes and Super-Categories

PartImageNet consists of 158 object classes selected from ILSVRC-12, mapped into 11 super-categories following WordNet: Quadruped (46), Biped (17), Fish (10), Bird (14), Snake (15), Reptile (20), Car (23), Bicycle (6), Boat (4), Aeroplane (2), and Bottle (5). Of these, 118 are non-rigid or articulated (primarily animals) and 40 are rigid (vehicles, bottles, etc.).

Image Composition

Approximately 24,095 images were retained post-filtering for occlusion, scene clutter, multiple targets, or suboptimal viewpoints—averaging about 152 images/class, with a flattened class frequency distribution: the most populous class contains only 1.25× as many instances as the 50th-ranked class, minimizing long-tail effects. Images per class range from several dozen in rare synsets up to a few hundred.

Part Granularity

PartImageNet emphasizes densely labeled part segmentation:

111,960 annotated segment instances (mean 4.65 parts/image)
Annotation density breakdown: 22% of images feature 1–2 visible parts; 58% have 3–6; 19% have 7–9; ≈2% have ≥10
Parts are defined compactly per super-category (e.g., Quadruped: head, body, foot, tail), with rare or optional parts merged to avoid zero-counts in minority synsets

Dataset Splits

For single-object segmentation, class images are randomly split: 85% training (20,481), 5% validation (1,206), and 10% test (2,408). Few-shot experiments use a non-overlapping synset split: 109 train, 19 validation, and 30 test classes (He et al., 2021).

2. Annotation Pipeline and Quality Control

Workflow

The annotation protocol is a three-stage process:

Annotators operate under super-category-specific instructions using a web segmentation interface; only relevant part labels are visible.
Inspector review: a random sample of each annotator’s output is checked; if the defect rate exceeds threshold, the batch is reassigned.
Examiners (domain experts) define part boundaries, handle edge cases (e.g., occlusion treatment), perform final audits, and adjudicate disputes.

Definitions and Consistency

Parts are defined at the super-category level for semantic salience and cross-class comparability. Annotators are mandated to label all visible, distinguishable parts, enforcing tight mask boundaries and using dedicated annotation groups per super-category to prevent drift in part definitions.

Quality Criteria

Maximum information: exhaustive labeling of visible, separable parts
Precise boundaries: masks must accurately follow object edges
Stringent rejection: any image failing inspection/examination is fully re-annotated

3. Format, Access, and Parsing

Data Organization

Images: JPEG/PNG, path pattern PartImageNet/images/{split}/{class_synset}/{image_id}.jpg
Masks: 8-bit indexed PNGs, pattern PartImageNet/masks/{split}/{class_synset}/{image_id}.png; pixel value 0 denotes background, 1..K signify part IDs per part_mapping.json
Metadata: JSON file records synset, split, image dimensions, and part-ID→RGB color mapping

This structure is compatible with standard image IO libraries in Python (e.g., PIL, OpenCV). Parsing requires a minimal Python codebase leveraging native image and JSON modules (He et al., 2021).

4. Dataset Statistics and Analysis

Per-Class and Per-Part Analysis

Each synset: mean 708 images, ~4,676 part instances
Rigid object classes generally have fewer distinct part types than non-rigid counterparts but similar total annotation density
The average object occupies ~40% of the image area (σ ≈ 15%). Part area distribution is log-normal: large parts (e.g., body) account for ~70% of foreground pixels; small parts (e.g., fins) often <5%.

Balance and Diversity

PartImageNet has a more gradual per-class instance tail than PASCAL-Part; annotation diversity captures articulated, deformable (limbs, wings), and rigid subparts (vehicle tires, mirrors). Notable challenges include non-rigid poses, occlusions, and fine-grained small parts.

Example: Quadruped Part Ontology

For the quadruped subset (46 species: tiger, lion, dog, etc.), the part-label ontology is consistent: {head, torso, leg, tail, background}. This structure supports transfer learning and domain adaptation benchmarking, as in Syn-to-Real adaptation scenarios (Peng et al., 2023).

5. Baseline Experiments

Semantic Part Segmentation

Three models—Semantic FPN (ResNet-50), DeepLabV3+ (ResNet-50), and SegFormer (MiT-B2)—were trained with 512×512 crops. mIoU (mean Intersection-over-Union) across all part classes:

Semantic FPN: val 56.8%, test 54.6%
DeepLabV3+: val 60.6%, test 58.7%
SegFormer: val 62.0%, test 61.5%

Failures are concentrated in imprecise boundary localization, confusion among similar parts (e.g., limbs), and small part omission.

Whole-Object Segmentation

Aggregating part masks produces an object segmentation benchmark, with fine-grained mIoU on the test set:

Semantic FPN: 60.1%
DeepLabV3+: 64.0%
SegFormer: 71.1%

Deep supervision by injecting part-mask objectives at intermediate encoder stages (e.g., stage-4 in DeepLabV3+) increases object mIoU by +1.05%.

Few-Shot Learning

PartImageNet supports a 5-way few-shot protocol (84×84 inputs). Reported 1-shot/5-shot results:

MAML: 46.9% / 58.1%
ProtoNet: 50.0% / 65.4%
RFS: 66.8% / 81.7%
Meta-Baseline: 68.0% / 82.7%
COMPAS (no parts): 67.1% / 82.3%
DeepEMD (no parts): 67.3% / 82.7%
COMPAS+parts: 68.0% (+0.9) / 82.9% (+0.6)
DeepEMD+parts: 68.5% (+1.2) / 83.6% (+0.9)

Supervised part cues offer measurable few-shot gains (He et al., 2021).

Syn-to-Real Transfer

PartImageNet underpins adaptation studies from synthetic animal parts (e.g., SAP) to real quadruped segmentation. For example, in the context of unsupervised domain adaptation (UDA) from synthetic to real, performance metrics (mIoU [%], across quadruped part labels) improved from 42.21 to 52.58 via advanced UDA methods such as SePiCo, with CB-FDM yielding further boosts; see table in (Peng et al., 2023) for detailed part-wise IoU.

6. Applications, Limitations, and Future Directions

Applications

Training and benchmarking part-aware architectures for semantic and instance segmentation
Assessing part grouping’s contribution to panoptic segmentation
Integrating part hierarchies in few-shot and transfer learning pipelines
Facilitating unsupervised part discovery via extensive ground-truth part corpus

Limitations

Mid-level granularity: parts are defined per super-category, not at the fine-grained class level; class-specific, rare, or non-standard parts (e.g., bicycle chain) are not annotated. Extending to deeper hierarchies or class-specific annotations would expand coverage.
Contextual limitation: multi-object scenes were filtered out, so the dataset lacks crowded or highly interactive environments.
Missing modalities: keypoint and bounding-box annotations are absent and could be added for tasks favoring those modalities.

A plausible implication is that PartImageNet occupies an intermediate regime between small-scale, human-focused part datasets and large-scale, generic object segmentation corpora—uniquely supporting systematic research into non-human, non-rigid, and pan-object part reasoning at scale.

7. Succession: PartImageNet++ (“PIN++”)

PartImageNet++ extends the paradigm, providing 100,000 fully part-annotated images across all 1,000 ImageNet-1K classes, with a part vocabulary of 3,308 labels and a total of over 400,000 masks (Li et al., 4 Jan 2026, Li et al., 2024). Annotations follow strict protocols: 3–8 semantic parts per object (sourced from Wikidata and volunteer surveys), explicit inclusion/hierarchy relations, and uniform partitioning of object foregrounds. The dataset organization enables efficient integration with Mask R-CNN and ViT-Det architectures for mask prediction, supporting a suite of downstream tasks (segmentation, recognition robustness, few-shot learning). Benchmarks show that exploiting part annotations yields notable gains in adversarial robustness, semantic accuracy, and transfer performance.

PIN++ maintains consistent annotation density and strong quality control, and it is available with code and scripts for reproducibility (see GitHub/HuggingFace links above).

In summary, PartImageNet and its successors are central resources for the empirical study and benchmarking of part-based visual recognition, providing scale, diversity, and annotation quality previously unavailable for generic objects (He et al., 2021, Li et al., 4 Jan 2026, Li et al., 2024, Peng et al., 2023).

Markdown Report Issue Upgrade to Chat

References (4)

PartImageNet: A Large, High-Quality Dataset of Parts (2021)

Learning Part Segmentation from Synthetic Animals (2023)

PartImageNet++ Dataset: Enhancing Visual Models with High-Quality Part Annotations (2026)

PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PartImageNet Dataset.