Papers
Topics
Authors
Recent
2000 character limit reached

Fashion-Gen: High-Res Fashion Dataset

Updated 25 November 2025
  • Fashion-Gen is a large-scale, high-resolution dataset featuring 325,536 studio-quality images paired with detailed stylist captions.
  • It supports both conditional and unconditional generative modeling with standardized benchmarks and multi-view photography for robust evaluation.
  • The dataset's rich metadata, hierarchical categorization, and expert annotations enable advanced research in text-to-image synthesis and multi-view garment modeling.

Fashion-Gen is a large-scale, high-resolution dataset and benchmarking infrastructure designed for conditional and unconditional generative modeling in the fashion domain. As introduced by Rostamzadeh et al., it comprises 325,536 studio-quality images—including 293,008 images paired with detailed paragraph-level captions by professional stylists—alongside rich meta-annotations, multi-view product photography, and a standardized challenge framework for text-to-image synthesis evaluation (Rostamzadeh et al., 2018). The dataset and associated challenge target text-conditioned image generation, multi-view modeling, and robust benchmarking under scalable human and automated evaluation protocols.

1. Dataset Composition and Annotation Structure

Fashion-Gen contains 325,536 images at 1360×13601360 \times 1360 resolution, divided into 260,480 training, 32,528 validation, and 32,528 test images. Of these, 293,008 images are paired with stylist-written captions (260,480 training and 32,528 validation). Each product (“item”) is photographed from one to six canonical, consistently-lit studio angles, without automatic cropping or resizing, maintaining a uniform white background.

A two-level category hierarchy is employed, consisting of 48 main categories (e.g., “Tops,” “Shoes,” “Bags”) and 121 fine-grained subcategories (e.g., “Crewnecks,” “Sandals,” “Messenger Bags”). Each item is annotated with designer/brand, fashion season, and stylist-composed recommended matches (complementary items). Color attributes are automatically parsed from textual description (e.g., “navy," “fuchsia,” “cream”). No pixel-level segmentations or bounding boxes are provided.

Stylist captions are written in paragraph form (typically 10–100 words, peaking at 30–50 words) and enumerate detailed garment descriptors—cut, fabric, fit, closures, pockets, stitching, color, and more. Authors recommend lowercase, NLTK word-tokenization, and stop-word removal during preprocessing.

Dataset releases are provided per split (train/validation/test), with each pack containing JPEG image folders and a mapping file (JSON/CSV) linking image IDs to textual and categorical metadata. Test set imagery and captions remain private for the challenge.

2. Baseline Generative Modeling Benchmarks

Two main conditional and unconditional generative baselines are established:

A. Unconditional High-Resolution Image Generation (Progressive GANs)

  • Architecture: Progressive Growing of GANs (P-GAN), following Karras et al. (2017) with symmetric layer addition in generator and discriminator from 4×44\times4 to 1024×10241024\times1024 (downsampling evaluation to 256×256256\times256). Default TensorFlow code and hyperparameters are used.
  • Input: random noise (no text).
  • Evaluation: Inception Score (IS) on 50,000 samples (256×256256\times256 downsampled). Real data: 9.71±2.149.71 \pm 2.14. P-GAN outputs: 7.91±0.157.91 \pm 0.15.
  • Qualitative output: strong global coherence in shape, color, pose.

B. Text-to-Image Synthesis

  • Two models: StackGAN-v1 and StackGAN-v2 (Zhang et al., 2017a/b).

    1. StackGAN-v1: Stage-I generates 64×6464\times64 image from text+noise; Stage-II produces 256×256256\times256 refinement. Training: 80 epochs (I), 185 epochs (II).
    2. StackGAN-v2: Joint cascade for 64×6464\times64, 128×128128\times128, 256×256256\times256, default PyTorch release configuration.
  • Text Encoders: Explored average word2vec, Transformer-based, char-CNN-RNN (Reed et al.), bidirectional LSTM. Bi-LSTM (last hidden state, size 1024, first 15 tokens) achieves best category prediction accuracy and is used for GAN conditioning.

  • Evaluation: Inception Score on 50,000 validation images. StackGAN-v1: 6.50±0.056.50 \pm 0.05; StackGAN-v2: 5.54±0.075.54 \pm 0.07 (mode collapse noted).
  • Qualitative observations: facial details are characteristically blurry or missing (networks attend to clothing). StackGAN-v2 susceptible to mode collapse under default settings.

3. Metadata, Category Structure, and Release Format

Each item is annotated not only with its primary description and categories but also designer/brand, season, color (algorithmically extracted), and recommended stylistic matches, which establish explicit inter-item relations. The hierarchical label taxonomy enables both coarse (e.g., “Dresses”) and fine-grained (e.g., “Cocktail Dresses”) conditional generation.

Data is distributed as per-split archives: JPEG images organized by item, with structured mapping files (JSON/CSV) connecting images, captions, semantic categories, and all available metadata. All public data is available through https://fashion-gen.com/. Test split ground truth remains undisclosed, reserved for challenge leaderboard evaluation.

4. The Fashion-Gen Challenge: Benchmarking and Evaluation

The Fashion-Gen challenge, launched at the ECCV Fashion, Art & Design workshop, targets text-to-image synthesis at up to 1360×13601360\times1360 resolution as its primary task; unconditional generation and leveraging multi-view or metadata are encouraged as auxiliary benchmarks.

Participants submit code as Docker containers according to a standard template. At evaluation, the platform provides withheld test captions; participant code must generate corresponding images for automated and human evaluation.

Leaderboard updates are determined by Inception Score, computed by a fixed Inception model trained on the train+val splits. Final ranking is established through a two-pass protocol: first by Inception Score, then by human annotation in case of ties. For human evaluation, each of NN fixed test descriptions is associated with five images (one per top-5 submission) and ranked by ten independent annotators; the aggregated result defines the winner.

No monetary incentive is specified. Winners receive recognition through workshop presentations and publication opportunities.

5. Technical Challenges, Limitations, and Recommendations

Key strengths of Fashion-Gen include its unprecedented scale in paired text-image data, studio-quality, multi-view photography, and breadth of expert-annotated metadata. This enables new research lines in text-to-image generation, multi-view and 3D garment reconstruction, and controllable attribute synthesis.

Limitations of the dataset are:

  • No pixel-level segmentation or pose keypoints; pose is inferred only via known camera angles.
  • Text captions emphasize detailed garment descriptors but provide minimal facial or body attribute annotation.
  • Inception Score is inadequate as a standalone text–image alignment evaluation, especially in fashion.
  • StackGAN-v2 demonstrates mode collapse, despite dataset cleanliness and scale.

Recommended research strategies include:

  • Pre-train and freeze robust text encoders (e.g. bi-LSTM, Transformer) on category classification before joint GAN training.
  • Employ multi-view consistency losses (e.g. cycle-consistency losses across multiple views).
  • Integrate explicit pose/landmark information to improve garment-to-model alignment.
  • Investigate attention-based generative frameworks (e.g. AttnGAN) for token–region alignment.
  • Combine retrieval-augmented generative approaches, leveraging nearest-neighbor patches for enhanced texture fidelity.
  • Incorporate human-in-the-loop evaluation early, mitigating potential misalignment undetected by Inception Score.
  • Explore progressive-growing architectures for text-conditioned high-resolution generation, synthesizing the advantages of P-GAN and StackGAN frameworks.

6. Research Applications and Prospects

Fashion-Gen’s scale, annotation quality, and release format enable a spectrum of research directions: high-resolution text-to-image synthesis, attribute-guided clothing generation, multi-view and 3D garment modeling, and benchmarking of robust, scalable image generation architectures.

The challenge infrastructure—comprising containerized evaluation, documented baselines, and systematic human+automated assessment—establishes a reproducible platform for method comparison, facilitating progress in both conditional and unconditional generative modeling. The dataset is positioned as an open invitation to improve upon initial StackGAN and P-GAN benchmarks and to leverage comprehensive metadata for novel multi-modal and multi-task objectives (Rostamzadeh et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Fashion-Gen Dataset.