MegaStyle-1.4M: Large-Scale Style Dataset
- MegaStyle-1.4M is a large-scale, high-fidelity style dataset with 1.4M images organized into 170K fine-grained groups, ensuring both intra-style consistency and inter-style diversity.
- It employs an automated curation pipeline and two-stage deduplication with hierarchical clustering to balance a long-tail distribution of natural and artistic styles.
- The dataset underpins advanced models like MegaStyle-Encoder and MegaStyle-FLUX, setting new evaluation benchmarks in style retrieval and style transfer research.
MegaStyle-1.4M is a large-scale, high-fidelity style dataset constructed via an automated curation pipeline designed to maximize intra-style consistency and inter-style diversity in image stylization tasks. It enables style-supervised contrastive learning for representation extraction as well as training of generalizable, style-accurate transfer models. The dataset consists of 1.4 million images systematically organized according to 170,000 fine-grained style descriptions, generated by decoupling style and content prompts, and synthesizing canonical stylizations with modern diffusion-based vision-LLMs. MegaStyle-1.4M establishes new evaluation standards for style retrieval and transfer, and underpins two novel models: the MegaStyle-Encoder and MegaStyle-FLUX.
1. Dataset Construction and Design Principles
The primary goals for MegaStyle-1.4M are threefold: ensuring images sharing a style description exhibit high intra-style consistency, maximizing coverage of the long-tail distribution of natural and artistic styles (inter-style diversity), and maintaining photorealistic, artifact-free image quality (Gao et al., 9 Apr 2026).
Image Pool Construction
- Style-pool (2 million images):
- 1 million deduplicated images from JourneyDB (Midjourney-derived).
- 80,000 paintings from WikiArt.
- 1 million synthetically stylized images filtered from LAION-Aesthetics using WikiArt tags.
- Content-pool (2 million images):
- Drawn from LAION-Aesthetics and explicitly filtered to exclude style-pool overlaps.
Prompt Generation and Balancing
Prompts are generated using Qwen3-VL (30B) under specialized instructions:
- Style prompts (max 32 words): Describe overall artistic style, color composition, lighting, medium, surface texture, and brushwork.
- Content prompts (max 64 words): Focus strictly on depicted objects and their visual relations, excluding all stylistic terms.
A two-stage deduplication and balance process, combining exact/fuzzy/semantic deduplication with hierarchical k-means (using mpnet embeddings), yields:
- 170,000 unique style prompts.
- 400,000 unique content prompts. Long-tail style coverage is achieved, with 8,355 overall style descriptors and only ∼33% frequency in the top 30 styles.
Stylized Image Generation
For each style prompt, eight distinct content prompts are sampled, generating paired content-style combinations. Qwen-Image with FlowMatch scheduler (40 steps, cfg=4.0) synthesizes stylized outputs. The final dataset consists of:
- MegaStyle-1.4M: 1.4 million images, each organized into one of 170,000 fine-grained style groups × 8 content pairings.
Comparative Dataset Analysis
| Dataset | Intra-style Consistency | #Overall Styles | #Fine-grained Styles | #Style Images |
|---|---|---|---|---|
| WikiArt | No | 27 | – | 80K |
| JourneyDB | No | – | 300K | 4.4M |
| Style30K | No | – | 1K | 30K |
| IMAGStyle | Yes | 14 | 15K | 210K |
| OmniStyle-150K | Yes | – | 1K | 150K |
| MegaStyle-1.4M | Yes | 8,355 | 170K | 1.4M |
MegaStyle-1.4M exhibits a highly granular style organization and excels in both intra-style consistency and inter-style diversity.
2. Style-Supervised Contrastive Learning and MegaStyle-Encoder
MegaStyle-Encoder is a vision encoder fine-tuned to extract style-specific representations, prioritizing stylistic attributes over content.
Contrastive Objective
Given style–image pairs , features are extracted as unit-normalized embeddings:
- (image; = SigLIP encoder)
- (text; = SigLIP text encoder)
The Style-Supervised Contrastive Learning (SSCL) objective:
- Supervised contrastive loss (SCL):
where are indices with the same style, are all others, and .
- Image–text contrastive loss (ITC):
with 0 if 1 is a true pair, else 2.
- Total objective: 3.
Training Protocol
- Encoders: SigLIP image encoder (SoViT-400M or ViT-L); text encoder and projector are frozen, only 4 updated.
- Hyperparameters: Batch size of 8,192, 30 epochs, learning rate 5, weight decay 0.01, 6.
MegaStyle-Encoder achieves robust style discrimination and enables scalable style-based retrieval and alignment.
3. MegaStyle-FLUX: Paired-Supervision Style Transfer with FLUX
MegaStyle-FLUX adapts the FLUX text-to-image diffusion-transformer (MM-DiT) architecture for style transfer under strict paired supervision.
Model Architecture
- Inputs (conditioning):
- Noisy image tokens (VAE-encoded target).
- Reference style image tokens (VAE-encoded, prepended).
- Content prompt tokens.
- Positional Encoding: A “shifted RoPE” encoding applied to style tokens prevents position collisions and content leakage across input types.
- Backbone: FLUX’s multi-modal DiT.
Training Dynamics
- Training data: MegaStyle-1.4M style image pairs.
- Styling pairs: For each batch, two images from the same style group are sampled—one as the reference for conditioning, the other as the stylization target.
- Loss: Standard denoising diffusion objective:
7
where conditioning includes both style and content tokens.
- Parameter updates: Only transformer parameters are fine-tuned; all others frozen.
- Training hyperparameters: 30,000 steps, batch size 8, learning rate 8, LoRA rank 128, image size 9.
MegaStyle-FLUX facilitates style transfer that reliably respects both input style and semantic content.
4. Empirical Analysis and Benchmarks
MegaStyle-1.4M, alongside MegaStyle-Encoder and MegaStyle-FLUX, establishes new performance standards across intra-style consistency, diversity, retrieval, and transfer tasks.
Intra-style Consistency and Inter-style Diversity
Qualitative analyses demonstrate that all images in a given style group maintain near-identical characteristics in color, texture, and brushwork across varied content, substantiating intra-style consistency. The dataset spans 8,355 overall style descriptors and 170,000 fine-grained classes, achieving long-tail diversity.
Style Similarity Retrieval
Style retrieval benchmarks (e.g., StyleRetrieval—2,400 styles × 32 contents) employ mAP@1, mAP@10, and Recall metrics. MegaStyle-Encoder exhibits substantial gains:
| Encoder | Backbone | mAP@1 | mAP@10 | R@1/R@10 |
|---|---|---|---|---|
| CLIP | ViT-L | 9.29 | 6.46 | 9.29/31.56 |
| CSD | ViT-L | 45.60 | 37.78 | 45.60/79.18 |
| MegaStyle-Encoder | ViT-L | 87.26 | 85.98 | 87.26/97.61 |
| SigLIP | SoViT | 10.43 | 7.83 | 10.43/36.32 |
| MegaStyle-Encoder | SoViT | 88.46 | 86.77 | 88.46/97.66 |
MegaStyle-Encoder extends this lead by 10–20 points on StyleBench, FLUX-Retrieval, and OmniStyle-150K.
Style Transfer Evaluation
Multiple baselines (StyleCrafter, DEADiff, Attn-Distill, InstantStyle, CSGO, StyleAligned, StyleShot) are evaluated on both style and text alignment (cosine similarity in MegaStyle-Encoder/CLIP space) as well as human preferences:
| Method | Style ↑ | Text ↑ | HumanStyle ↑ | HumanText ↑ |
|---|---|---|---|---|
| StyleCrafter | 48.59 | 21.39 | 3.41 | 8.87 |
| DEADiff | 51.34 | 23.13 | 3.05 | 11.13 |
| Attn-Distill | 85.59 | 20.29 | 13.97 | 6.31 |
| InstantStyle | 71.41 | 20.77 | 18.19 | 10.98 |
| CSGO | 55.02 | 23.05 | 7.34 | 16.18 |
| StyleAligned | 59.80 | 21.31 | 7.46 | 4.12 |
| StyleShot | 63.42 | 21.79 | 15.21 | 13.69 |
| MegaStyle-FLUX | 76.16 | 23.20 | 31.37 | 28.72 |
MegaStyle-FLUX demonstrates superior style fidelity and semantic preservation, with user studies confirming strong human preference.
5. Ablation Studies and Robustness Assessments
Training Set Effects
| Training Set | Style↑ | Text↑ |
|---|---|---|
| JourneyDB | 34.56 | 21.12 |
| OmniStyle-150K | 51.49 | 23.02 |
| MegaStyle-1.4M | 76.16 | 23.20 |
Training on JourneyDB leads to poor intra-style consistency (often missing even color alignment), while OmniStyle-150K achieves color transfer but fails at higher-level stylistics. MegaStyle-1.4M enables both strong style and semantic alignment.
Encoder Generalization
On StyleBench (275 real paintings, 40 styles), FLUX-Retrieval, and OmniStyle-150K retrieval, MegaStyle-Encoder outperforms CLIP/CSD by 10–30 points in mAP/Recall, indicating robust generalization beyond the distribution of the generator (Qwen-Image).
Model Fine-tuning
Fine-tuning baseline models on MegaStyle-1.4M yields significant improvements:
- StyleShot-FLUX (trained on StyleGallery): Style 57.06, Text 21.86.
- StyleShot-FLUX-Mega (re-trained on MegaStyle-1.4M): Style 67.73, Text 23.27.
- MegaStyle-FLUX: Style 76.16, Text 23.20.
A plausible implication is that style-paired supervision from MegaStyle-1.4M is critical for style transfer models to encode both low-level and structural stylistic features.
Qualitative Outcomes
Visualizations demonstrate that MegaStyle-FLUX uniquely captures and preserves intricate style elements—color, lighting, surface texture, and brushstroke—while maintaining content alignment with prompts.
6. Significance and Implications for Style Transfer Research
MegaStyle-1.4M, through its systematic pairing of style and content and its breadth of stylization, transforms both quantitative and qualitative evaluations in style-centric computer vision. The dataset provides a rigorous foundation for style similarity retrieval and robust, generalizable style transfer. The results imply a paradigm shift away from ad hoc style curation towards large-scale, automatically balanced style datasets, and suggest that modern diffusion models and contrastive objectives are synergistic when combined with high-coverage, high-consistency data. MegaStyle-1.4M and its associated benchmarks are poised to serve as definitive standards in style transfer and retrieval research (Gao et al., 9 Apr 2026).