Multi-Attribute Description Datasets

Updated 23 February 2026

Multi-attribute description datasets are structured resources that annotate entities with diverse categorical, binary, ordinal, or continuous attributes across images, videos, text, and mixed modalities.
They combine manual and automated annotation methods—often supported by AI-driven validation—to achieve high-quality, multi-label annotations with rigorous error controls and metric-based evaluations.
These datasets enable fine-grained tasks such as multi-label classification, conditional generation, and compositional recognition in domains like e-commerce, biomedical imaging, and video analysis.

A multi-attribute description dataset is a structured resource in which entities—spanning images, videos, text, or mixed modalities—are richly annotated with multiple, often interdependent, attribute categories. Each entity instance may have several attribute labels (or attribute–value pairs), potentially spanning diverse taxonomies: categorical (e.g., “color”), binary (“hat: yes/no”), ordinal, or even continuous-valued (e.g., “aesthetic score”). Such datasets are foundational for multi-label classification, attribute extraction, conditional generation, and open-vocabulary recognition, enabling fine-grained reasoning beyond single-class assignment and supporting compositional semantic tasks. The following survey situates canonical and state-of-the-art multi-attribute description datasets, detailing their taxonomies, annotation methodologies, dataset organization, and downstream impact.

1. Core Taxonomic Structures

Multi-attribute datasets are defined by the breadth, granularity, and interrelationships of their attribute schemas. For example, iFashion-Attribute employs an 8-group, 228-attribute taxonomy, spanning Category (105 classes), Color (21), Pattern (28), Neckline, Sleeve, Style, Gender, and Material—each image annotated with an average of ≈5.8–8.0 attributes (Guo et al., 2019). In contrast, Objects365-Attr extends the Objects365 detection corpus to include 39 subcategories grouped under Color, Material, State, Texture, and Tone, ensuring fine-grained box-level attribute coverage across 364 object classes and 1.4M bounding boxes (Qi et al., 2024).

In the attribute extraction domain, resources such as MAE (IV et al., 2017) and MAVE (Yang et al., 2021) offer open-vocabulary schemas: MAE includes ≈2,100 attribute types and ≈23,600 values, with free-text keys per record; MAVE explicitly defines 705 attributes spanning 1,257 product categories, yielding extensive category–attribute pairings (≈2,535 observed). Video-based datasets such as VideoAVE enumerate 172 unique e-commerce attributes over 14 domains, while ImplicitAVE targets 25 attributes across 5 domains (clothing, footwear, jewelry, food, home product) and focuses exclusively on implicit attribute value inference (Cheng et al., 15 Aug 2025, Zou et al., 2024).

Annotation schemas also span multi-label binary (e.g., presence/absence), multi-class, hierarchical, and multi-source value extraction (free-form, BIO tagging, continuous regression), reflecting the growing complexity and specificity of modern multi-attribute corpora.

2. Annotation Methodologies and Quality Assurance

Dataset construction leverages both manual and automated strategies, with increasing use of human–AI hybrid annotation and adversarial validation. In iFashion-Attribute, initial attribute assignments leveraged expert staff, followed by deduplication, automated cross-referencing with product metadata, and adjudicated manual review—ensuring non-overlapping groups and a final set of 228 attributes. Validation/test images were double-annotated, with disputes resolved or filtered by a third annotator (Guo et al., 2019).

Objects365-Attr introduces a three-stage pipeline: (1) a 4,000-instance human-labeled seed dataset; (2) LLaVA-13B vision-LLM fine-tuning/inference for predicting attributes on box crops; and (3) iterative manual verification until <2% error remains, blending high-throughput automation with targeted human oversight (Qi et al., 2024). Similarly, MEPAVE engaged e-commerce–fluent annotators to reconcile image–text congruence and span-wise value labeling with inter-annotator agreement of 92.83% (Zhu et al., 2020).

ImplicitAVE removes all explicit references to ground-truth attributes in text, forcing annotation to rely on images or indirect cues, with two-stage PhD-level and senior review yielding 86.4% raw agreement and estimated κ≈0.83 (Zou et al., 2024).

Quality control often incorporates frequency filtering, inconsistency pruning, adversarial “incongruent image” splits, and redundancy for complex domains or long-tail classes.

3. Representative Datasets: Composition and Statistics

Selected exemplars illustrate the diversity and depth of available multi-attribute resources:

Dataset	Entities	Attr. Schema	#Instances	Modality
iFashion-Attribute	Fashion img	8 groups, 228 label	1,062,550	Image
Objects365-Attr	Detection	5 groups, 39 label	1.4M boxes, 5.6M labels	Image-bbox
MALS	Person	27 binary attr.	1,510,330 pairs	Image-text
MAE	Product	~2,100 attr.	2.2M records	Text+image
MAVE	Product	705 attr., 1,257 cat	2.2M records	Multi-text
VideoAVE	Product	172 attr., 14 do	248,800 video-title	Video+title
AMD-A	General img	3 low-level attr.	16,924	Image
WBCAtt	Cell micro.	11 morphol. attr.	10,298	Image
ImplicitAVE	E-com.	25 attr., 5 do.	70,214	Image-text

Attribute distributions are typically long-tailed: iFashion-Attribute spans <500 (31 classes) to >10,000 (88 classes) images per attribute; MAVE has “head” attributes in millions and a long tail with hundreds. Temporal (video-based) settings in VideoAVE report domain-wise attribute counts of 18–71 (mean: 3.43/pair) while AMD-A measures continuous attribute distributions for light, color, and composition (mean/σ~0.54/0.13–0.15).

4. Evaluation Protocols, Metrics, and Benchmarks

Evaluation is tightly coupled to the multi-label structure. Canonical metrics include:

Micro-Precision, Recall, F₁: Aggregated over all labels or images (as in iFashion-Attribute: top-8 predictions per image) (Guo et al., 2019).
Mean Average Precision (mAP): Per-attribute average for retrieval/recognition tasks (iFashion-Attribute, MALS).
Instance, Label-Based Metrics: Mean Accuracy (mA), instance-level F1, and balanced accuracy (used in PETA, RAP, PA-100K for pedestrian attributes) (Tang et al., 2019).
Span-Level Extraction: For value extraction, F₁ is computed over extracted vs. gold spans (MAVE, MEPAVE).
Open-Set and Zero-Shot Recalls: Especially for MAVE (zero-shot attributes) (Yang et al., 2021), and for generative/implicit AVE benchmarks (e.g., ImplicitAVE (Zou et al., 2024)).
Temporal–Fuzzy Matching: In VideoAVE, attribute-level F₁ is based on fuzzy substring match for open value extraction (Cheng et al., 15 Aug 2025).

Standardized train/val/test splits, fixed top-k prediction protocols, and cross-dataset transfer (e.g., MALS–APTM pre-training boosting Recall@1 on multiple PEDES corpora (Yang et al., 2023)) are key for reproducible benchmarks.

5. Domain-Specific Variants and Modalities

Beyond conventional images and text, multi-attribute description datasets have expanded into new modalities and settings:

Video: VideoAVE introduces extraction from product demonstration videos, using Mixture-of-Experts filtering and evaluating on both attribute-conditioned and open-pair settings (Cheng et al., 15 Aug 2025).
Aesthetics: AMD-A offers continuous scoring over three compositional attributes and overall aesthetic value, supporting both regression and classification (Jin et al., 2022).
Biomedical: WBCAtt annotates 10K+ blood cell images with 11 morphological features, supporting hybrid CNN/ViT models and achieving >94% global average accuracy (Houmaidi et al., 30 Sep 2025).
Implicit Reasoning: ImplicitAVE measures model ability to infer attributes not found in text, with high-precision human validation to ensure that ground-truth is only accessible via visual or indirect cues (Zou et al., 2024).
Synthetic Data: The MALS corpus uses photorealistic diffusion-generated images and auto-extracted attributes to achieve privacy, scale, and new pre-training opportunities (Yang et al., 2023).

6. Impact, Applications, and Open Research Challenges

Multi-attribute datasets underpin progress in open-vocabulary detection, fine-grained recognition, compositional generation, and multi-modal attribute extraction:

Objects365-Attr demonstrates that integrating multi-attribute annotations improves rare-class recall (ΔAPr up to +3.1), open-vocabulary detection mAP, and interpretability/grounding in referential tasks (Qi et al., 2024).
MALS and APTM show that multi-attribute pre-training delivers significant gains in cross-domain retrieval and text-image matching (Yang et al., 2023).
In e-commerce, datasets such as MAE, MEPAVE, MAVE promote structured value extraction, product QA, and downstream tasks like recommendation and search (IV et al., 2017, Zhu et al., 2020, Yang et al., 2021).
Continuous attribute regression (e.g., in AMD-A) enables nuanced aesthetic assessment and attribute-driven ranking (Jin et al., 2022).
Biomedical annotation automation (AttriGen) offers order-of-magnitude annotation acceleration with minimal performance degradation (Houmaidi et al., 30 Sep 2025).

Current challenges include head-tail skew in attribute distributions, zero-shot generalization for new attributes, robust multi-modal and video fusion, reliable implicit inference, and semantic schema alignment across cultures and annotation layers. Best practices emphasize balanced taxonomies, rigorous QC pipelines, periodic attribute set revision, and strong baseline benchmarks.

7. Future Directions

Extending multi-attribute description datasets to encompass more domains (medical, industrial, scientific), richer attribute vocabularies (including verbs/adjectives, composites), greater label interdependency modeling, and multi-scene/multi-entity compositions remains an open challenge. Automated pipeline enhancements (e.g., active/human-in-the-loop annotation, domain-adaptive augmentation), zero-shot/meta-learning regimes, and knowledge-driven fusion with large vision–LLMs are promising research directions. Maintaining open-source releases, public benchmarks, and reproducible scripts (as in Objects365-Attr, MAVE, WBCAtt, VideoAVE, ImplicitAVE) is essential for the continued progress and comparative empirical rigor of the multi-attribute learning community.