MegaStyle-1.4M: Large-Scale Style Dataset

Updated 3 July 2026

MegaStyle-1.4M is a large-scale, high-fidelity style dataset with 1.4M images organized into 170K fine-grained groups, ensuring both intra-style consistency and inter-style diversity.
It employs an automated curation pipeline and two-stage deduplication with hierarchical clustering to balance a long-tail distribution of natural and artistic styles.
The dataset underpins advanced models like MegaStyle-Encoder and MegaStyle-FLUX, setting new evaluation benchmarks in style retrieval and style transfer research.

MegaStyle-1.4M is a large-scale, high-fidelity style dataset constructed via an automated curation pipeline designed to maximize intra-style consistency and inter-style diversity in image stylization tasks. It enables style-supervised contrastive learning for representation extraction as well as training of generalizable, style-accurate transfer models. The dataset consists of 1.4 million images systematically organized according to 170,000 fine-grained style descriptions, generated by decoupling style and content prompts, and synthesizing canonical stylizations with modern diffusion-based vision-LLMs. MegaStyle-1.4M establishes new evaluation standards for style retrieval and transfer, and underpins two novel models: the MegaStyle-Encoder and MegaStyle-FLUX.

1. Dataset Construction and Design Principles

The primary goals for MegaStyle-1.4M are threefold: ensuring images sharing a style description exhibit high intra-style consistency, maximizing coverage of the long-tail distribution of natural and artistic styles (inter-style diversity), and maintaining photorealistic, artifact-free image quality (Gao et al., 9 Apr 2026).

Image Pool Construction

Style-pool (2 million images):
- 1 million deduplicated images from JourneyDB (Midjourney-derived).
- 80,000 paintings from WikiArt.
- 1 million synthetically stylized images filtered from LAION-Aesthetics using WikiArt tags.
Content-pool (2 million images):
- Drawn from LAION-Aesthetics and explicitly filtered to exclude style-pool overlaps.

Prompt Generation and Balancing

Prompts are generated using Qwen3-VL (30B) under specialized instructions:

Style prompts (max 32 words): Describe overall artistic style, color composition, lighting, medium, surface texture, and brushwork.
Content prompts (max 64 words): Focus strictly on depicted objects and their visual relations, excluding all stylistic terms.

A two-stage deduplication and balance process, combining exact/fuzzy/semantic deduplication with hierarchical k-means (using mpnet embeddings), yields:

170,000 unique style prompts.
400,000 unique content prompts. Long-tail style coverage is achieved, with 8,355 overall style descriptors and only ∼33% frequency in the top 30 styles.

Stylized Image Generation

For each style prompt, eight distinct content prompts are sampled, generating paired content-style combinations. Qwen-Image with FlowMatch scheduler (40 steps, cfg=4.0) synthesizes stylized outputs. The final dataset consists of:

MegaStyle-1.4M: 1.4 million images, each organized into one of 170,000 fine-grained style groups × 8 content pairings.

Comparative Dataset Analysis

Dataset	Intra-style Consistency	#Overall Styles	#Fine-grained Styles	#Style Images
WikiArt	No	27	–	80K
JourneyDB	No	–	300K	4.4M
Style30K	No	–	1K	30K
IMAGStyle	Yes	14	15K	210K
OmniStyle-150K	Yes	–	1K	150K
MegaStyle-1.4M	Yes	8,355	170K	1.4M

MegaStyle-1.4M exhibits a highly granular style organization and excels in both intra-style consistency and inter-style diversity.

2. Style-Supervised Contrastive Learning and MegaStyle-Encoder

MegaStyle-Encoder is a vision encoder fine-tuned to extract style-specific representations, prioritizing stylistic attributes over content.

Contrastive Objective

Given style–image pairs $\{(x_i, s_i)\}$ , features are extracted as unit-normalized embeddings:

$z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_2$ (image; $E_\theta$ = SigLIP encoder)
$t_j = \phi(s_j)/\|\phi(s_j)\|_2$ (text; $\phi$ = SigLIP text encoder)

The Style-Supervised Contrastive Learning (SSCL) objective:

Supervised contrastive loss (SCL):

$\mathcal{L}_{\text{scl}} = \frac{1}{MN} \sum_i \left[ -\frac{1}{|P(i)|} \sum_{p\in P(i)} \log\frac{\exp(z_i\cdot z_p/\tau)}{ \sum_{a\in A(i)}\exp(z_i\cdot z_a/\tau)} \right]$

where $P(i)$ are indices with the same style, $A(i)$ are all others, and $\tau=0.07$ .

Image–text contrastive loss (ITC):

$\mathcal{L}_{\text{itc}} = \frac{1}{MN^2}\sum_{i,j} \log\left(1+\exp(-y_{ij} z_i\cdot t_j)\right)$

Total objective: $z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_2$ 3.

Training Protocol

Encoders: SigLIP image encoder (SoViT-400M or ViT-L); text encoder and projector are frozen, only $z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_2$ 4 updated.
Hyperparameters: Batch size of 8,192, 30 epochs, learning rate $z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_2$ 5, weight decay 0.01, $z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_2$ 6.

MegaStyle-Encoder achieves robust style discrimination and enables scalable style-based retrieval and alignment.

3. MegaStyle-FLUX: Paired-Supervision Style Transfer with FLUX

MegaStyle-FLUX adapts the FLUX text-to-image diffusion-transformer (MM-DiT) architecture for style transfer under strict paired supervision.

Model Architecture

Inputs (conditioning):
- Noisy image tokens (VAE-encoded target).
- Reference style image tokens (VAE-encoded, prepended).
- Content prompt tokens.
Positional Encoding: A “shifted RoPE” encoding applied to style tokens prevents position collisions and content leakage across input types.
Backbone: FLUX’s multi-modal DiT.

Training Dynamics

Training data: MegaStyle-1.4M style image pairs.
Styling pairs: For each batch, two images from the same style group are sampled—one as the reference for conditioning, the other as the stylization target.
Loss: Standard denoising diffusion objective:

$z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_2$ 7

where conditioning includes both style and content tokens.

Parameter updates: Only transformer parameters are fine-tuned; all others frozen.
Training hyperparameters: 30,000 steps, batch size 8, learning rate $z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_2$ 8, LoRA rank 128, image size $z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_2$ 9.

MegaStyle-FLUX facilitates style transfer that reliably respects both input style and semantic content.

4. Empirical Analysis and Benchmarks

MegaStyle-1.4M, alongside MegaStyle-Encoder and MegaStyle-FLUX, establishes new performance standards across intra-style consistency, diversity, retrieval, and transfer tasks.

Intra-style Consistency and Inter-style Diversity

Qualitative analyses demonstrate that all images in a given style group maintain near-identical characteristics in color, texture, and brushwork across varied content, substantiating intra-style consistency. The dataset spans 8,355 overall style descriptors and 170,000 fine-grained classes, achieving long-tail diversity.

Style Similarity Retrieval

Style retrieval benchmarks (e.g., StyleRetrieval—2,400 styles × 32 contents) employ mAP@1, mAP@10, and Recall metrics. MegaStyle-Encoder exhibits substantial gains:

Encoder	Backbone	mAP@1	mAP@10	R@1/R@10
CLIP	ViT-L	9.29	6.46	9.29/31.56
CSD	ViT-L	45.60	37.78	45.60/79.18
MegaStyle-Encoder	ViT-L	87.26	85.98	87.26/97.61
SigLIP	SoViT	10.43	7.83	10.43/36.32
MegaStyle-Encoder	SoViT	88.46	86.77	88.46/97.66

MegaStyle-Encoder extends this lead by 10–20 points on StyleBench, FLUX-Retrieval, and OmniStyle-150K.

Style Transfer Evaluation

Multiple baselines (StyleCrafter, DEADiff, Attn-Distill, InstantStyle, CSGO, StyleAligned, StyleShot) are evaluated on both style and text alignment (cosine similarity in MegaStyle-Encoder/CLIP space) as well as human preferences:

Method	Style ↑	Text ↑	HumanStyle ↑	HumanText ↑
StyleCrafter	48.59	21.39	3.41	8.87
DEADiff	51.34	23.13	3.05	11.13
Attn-Distill	85.59	20.29	13.97	6.31
InstantStyle	71.41	20.77	18.19	10.98
CSGO	55.02	23.05	7.34	16.18
StyleAligned	59.80	21.31	7.46	4.12
StyleShot	63.42	21.79	15.21	13.69
MegaStyle-FLUX	76.16	23.20	31.37	28.72

MegaStyle-FLUX demonstrates superior style fidelity and semantic preservation, with user studies confirming strong human preference.

5. Ablation Studies and Robustness Assessments

Training Set Effects

Training Set	Style↑	Text↑
JourneyDB	34.56	21.12
OmniStyle-150K	51.49	23.02
MegaStyle-1.4M	76.16	23.20

Training on JourneyDB leads to poor intra-style consistency (often missing even color alignment), while OmniStyle-150K achieves color transfer but fails at higher-level stylistics. MegaStyle-1.4M enables both strong style and semantic alignment.

Encoder Generalization

On StyleBench (275 real paintings, 40 styles), FLUX-Retrieval, and OmniStyle-150K retrieval, MegaStyle-Encoder outperforms CLIP/CSD by 10–30 points in mAP/Recall, indicating robust generalization beyond the distribution of the generator (Qwen-Image).

Model Fine-tuning

Fine-tuning baseline models on MegaStyle-1.4M yields significant improvements:

StyleShot-FLUX (trained on StyleGallery): Style 57.06, Text 21.86.
StyleShot-FLUX-Mega (re-trained on MegaStyle-1.4M): Style 67.73, Text 23.27.
MegaStyle-FLUX: Style 76.16, Text 23.20.

A plausible implication is that style-paired supervision from MegaStyle-1.4M is critical for style transfer models to encode both low-level and structural stylistic features.

Qualitative Outcomes

Visualizations demonstrate that MegaStyle-FLUX uniquely captures and preserves intricate style elements—color, lighting, surface texture, and brushstroke—while maintaining content alignment with prompts.

6. Significance and Implications for Style Transfer Research

MegaStyle-1.4M, through its systematic pairing of style and content and its breadth of stylization, transforms both quantitative and qualitative evaluations in style-centric computer vision. The dataset provides a rigorous foundation for style similarity retrieval and robust, generalizable style transfer. The results imply a paradigm shift away from ad hoc style curation towards large-scale, automatically balanced style datasets, and suggest that modern diffusion models and contrastive objectives are synergistic when combined with high-coverage, high-consistency data. MegaStyle-1.4M and its associated benchmarks are poised to serve as definitive standards in style transfer and retrieval research (Gao et al., 9 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MegaStyle-1.4M.