Papers
Topics
Authors
Recent
Search
2000 character limit reached

MegaStyle-1.4M: Large-Scale Style Dataset

Updated 3 July 2026
  • MegaStyle-1.4M is a large-scale, high-fidelity style dataset with 1.4M images organized into 170K fine-grained groups, ensuring both intra-style consistency and inter-style diversity.
  • It employs an automated curation pipeline and two-stage deduplication with hierarchical clustering to balance a long-tail distribution of natural and artistic styles.
  • The dataset underpins advanced models like MegaStyle-Encoder and MegaStyle-FLUX, setting new evaluation benchmarks in style retrieval and style transfer research.

MegaStyle-1.4M is a large-scale, high-fidelity style dataset constructed via an automated curation pipeline designed to maximize intra-style consistency and inter-style diversity in image stylization tasks. It enables style-supervised contrastive learning for representation extraction as well as training of generalizable, style-accurate transfer models. The dataset consists of 1.4 million images systematically organized according to 170,000 fine-grained style descriptions, generated by decoupling style and content prompts, and synthesizing canonical stylizations with modern diffusion-based vision-LLMs. MegaStyle-1.4M establishes new evaluation standards for style retrieval and transfer, and underpins two novel models: the MegaStyle-Encoder and MegaStyle-FLUX.

1. Dataset Construction and Design Principles

The primary goals for MegaStyle-1.4M are threefold: ensuring images sharing a style description exhibit high intra-style consistency, maximizing coverage of the long-tail distribution of natural and artistic styles (inter-style diversity), and maintaining photorealistic, artifact-free image quality (Gao et al., 9 Apr 2026).

Image Pool Construction

  • Style-pool (2 million images):
    • 1 million deduplicated images from JourneyDB (Midjourney-derived).
    • 80,000 paintings from WikiArt.
    • 1 million synthetically stylized images filtered from LAION-Aesthetics using WikiArt tags.
  • Content-pool (2 million images):
    • Drawn from LAION-Aesthetics and explicitly filtered to exclude style-pool overlaps.

Prompt Generation and Balancing

Prompts are generated using Qwen3-VL (30B) under specialized instructions:

  • Style prompts (max 32 words): Describe overall artistic style, color composition, lighting, medium, surface texture, and brushwork.
  • Content prompts (max 64 words): Focus strictly on depicted objects and their visual relations, excluding all stylistic terms.

A two-stage deduplication and balance process, combining exact/fuzzy/semantic deduplication with hierarchical k-means (using mpnet embeddings), yields:

  • 170,000 unique style prompts.
  • 400,000 unique content prompts. Long-tail style coverage is achieved, with 8,355 overall style descriptors and only ∼33% frequency in the top 30 styles.

Stylized Image Generation

For each style prompt, eight distinct content prompts are sampled, generating paired content-style combinations. Qwen-Image with FlowMatch scheduler (40 steps, cfg=4.0) synthesizes stylized outputs. The final dataset consists of:

  • MegaStyle-1.4M: 1.4 million images, each organized into one of 170,000 fine-grained style groups × 8 content pairings.

Comparative Dataset Analysis

Dataset Intra-style Consistency #Overall Styles #Fine-grained Styles #Style Images
WikiArt No 27 80K
JourneyDB No 300K 4.4M
Style30K No 1K 30K
IMAGStyle Yes 14 15K 210K
OmniStyle-150K Yes 1K 150K
MegaStyle-1.4M Yes 8,355 170K 1.4M

MegaStyle-1.4M exhibits a highly granular style organization and excels in both intra-style consistency and inter-style diversity.

2. Style-Supervised Contrastive Learning and MegaStyle-Encoder

MegaStyle-Encoder is a vision encoder fine-tuned to extract style-specific representations, prioritizing stylistic attributes over content.

Contrastive Objective

Given style–image pairs {(xi,si)}\{(x_i, s_i)\}, features are extracted as unit-normalized embeddings:

  • zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_2 (image; EθE_\theta = SigLIP encoder)
  • tj=ϕ(sj)/ϕ(sj)2t_j = \phi(s_j)/\|\phi(s_j)\|_2 (text; ϕ\phi = SigLIP text encoder)

The Style-Supervised Contrastive Learning (SSCL) objective:

Lscl=1MNi[1P(i)pP(i)logexp(zizp/τ)aA(i)exp(ziza/τ)]\mathcal{L}_{\text{scl}} = \frac{1}{MN} \sum_i \left[ -\frac{1}{|P(i)|} \sum_{p\in P(i)} \log\frac{\exp(z_i\cdot z_p/\tau)}{ \sum_{a\in A(i)}\exp(z_i\cdot z_a/\tau)} \right]

where P(i)P(i) are indices with the same style, A(i)A(i) are all others, and τ=0.07\tau=0.07.

Litc=1MN2i,jlog(1+exp(yijzitj))\mathcal{L}_{\text{itc}} = \frac{1}{MN^2}\sum_{i,j} \log\left(1+\exp(-y_{ij} z_i\cdot t_j)\right)

with zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_20 if zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_21 is a true pair, else zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_22.

  • Total objective: zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_23.

Training Protocol

  • Encoders: SigLIP image encoder (SoViT-400M or ViT-L); text encoder and projector are frozen, only zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_24 updated.
  • Hyperparameters: Batch size of 8,192, 30 epochs, learning rate zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_25, weight decay 0.01, zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_26.

MegaStyle-Encoder achieves robust style discrimination and enables scalable style-based retrieval and alignment.

3. MegaStyle-FLUX: Paired-Supervision Style Transfer with FLUX

MegaStyle-FLUX adapts the FLUX text-to-image diffusion-transformer (MM-DiT) architecture for style transfer under strict paired supervision.

Model Architecture

  • Inputs (conditioning):
    • Noisy image tokens (VAE-encoded target).
    • Reference style image tokens (VAE-encoded, prepended).
    • Content prompt tokens.
  • Positional Encoding: A “shifted RoPE” encoding applied to style tokens prevents position collisions and content leakage across input types.
  • Backbone: FLUX’s multi-modal DiT.

Training Dynamics

  • Training data: MegaStyle-1.4M style image pairs.
  • Styling pairs: For each batch, two images from the same style group are sampled—one as the reference for conditioning, the other as the stylization target.
  • Loss: Standard denoising diffusion objective:

zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_27

where conditioning includes both style and content tokens.

  • Parameter updates: Only transformer parameters are fine-tuned; all others frozen.
  • Training hyperparameters: 30,000 steps, batch size 8, learning rate zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_28, LoRA rank 128, image size zi=Eθ(xi)/Eθ(xi)2z_i = E_\theta(x_i)/\|E_\theta(x_i)\|_29.

MegaStyle-FLUX facilitates style transfer that reliably respects both input style and semantic content.

4. Empirical Analysis and Benchmarks

MegaStyle-1.4M, alongside MegaStyle-Encoder and MegaStyle-FLUX, establishes new performance standards across intra-style consistency, diversity, retrieval, and transfer tasks.

Intra-style Consistency and Inter-style Diversity

Qualitative analyses demonstrate that all images in a given style group maintain near-identical characteristics in color, texture, and brushwork across varied content, substantiating intra-style consistency. The dataset spans 8,355 overall style descriptors and 170,000 fine-grained classes, achieving long-tail diversity.

Style Similarity Retrieval

Style retrieval benchmarks (e.g., StyleRetrieval—2,400 styles × 32 contents) employ mAP@1, mAP@10, and Recall metrics. MegaStyle-Encoder exhibits substantial gains:

Encoder Backbone mAP@1 mAP@10 R@1/R@10
CLIP ViT-L 9.29 6.46 9.29/31.56
CSD ViT-L 45.60 37.78 45.60/79.18
MegaStyle-Encoder ViT-L 87.26 85.98 87.26/97.61
SigLIP SoViT 10.43 7.83 10.43/36.32
MegaStyle-Encoder SoViT 88.46 86.77 88.46/97.66

MegaStyle-Encoder extends this lead by 10–20 points on StyleBench, FLUX-Retrieval, and OmniStyle-150K.

Style Transfer Evaluation

Multiple baselines (StyleCrafter, DEADiff, Attn-Distill, InstantStyle, CSGO, StyleAligned, StyleShot) are evaluated on both style and text alignment (cosine similarity in MegaStyle-Encoder/CLIP space) as well as human preferences:

Method Style ↑ Text ↑ HumanStyle ↑ HumanText ↑
StyleCrafter 48.59 21.39 3.41 8.87
DEADiff 51.34 23.13 3.05 11.13
Attn-Distill 85.59 20.29 13.97 6.31
InstantStyle 71.41 20.77 18.19 10.98
CSGO 55.02 23.05 7.34 16.18
StyleAligned 59.80 21.31 7.46 4.12
StyleShot 63.42 21.79 15.21 13.69
MegaStyle-FLUX 76.16 23.20 31.37 28.72

MegaStyle-FLUX demonstrates superior style fidelity and semantic preservation, with user studies confirming strong human preference.

5. Ablation Studies and Robustness Assessments

Training Set Effects

Training Set Style↑ Text↑
JourneyDB 34.56 21.12
OmniStyle-150K 51.49 23.02
MegaStyle-1.4M 76.16 23.20

Training on JourneyDB leads to poor intra-style consistency (often missing even color alignment), while OmniStyle-150K achieves color transfer but fails at higher-level stylistics. MegaStyle-1.4M enables both strong style and semantic alignment.

Encoder Generalization

On StyleBench (275 real paintings, 40 styles), FLUX-Retrieval, and OmniStyle-150K retrieval, MegaStyle-Encoder outperforms CLIP/CSD by 10–30 points in mAP/Recall, indicating robust generalization beyond the distribution of the generator (Qwen-Image).

Model Fine-tuning

Fine-tuning baseline models on MegaStyle-1.4M yields significant improvements:

  • StyleShot-FLUX (trained on StyleGallery): Style 57.06, Text 21.86.
  • StyleShot-FLUX-Mega (re-trained on MegaStyle-1.4M): Style 67.73, Text 23.27.
  • MegaStyle-FLUX: Style 76.16, Text 23.20.

A plausible implication is that style-paired supervision from MegaStyle-1.4M is critical for style transfer models to encode both low-level and structural stylistic features.

Qualitative Outcomes

Visualizations demonstrate that MegaStyle-FLUX uniquely captures and preserves intricate style elements—color, lighting, surface texture, and brushstroke—while maintaining content alignment with prompts.

6. Significance and Implications for Style Transfer Research

MegaStyle-1.4M, through its systematic pairing of style and content and its breadth of stylization, transforms both quantitative and qualitative evaluations in style-centric computer vision. The dataset provides a rigorous foundation for style similarity retrieval and robust, generalizable style transfer. The results imply a paradigm shift away from ad hoc style curation towards large-scale, automatically balanced style datasets, and suggest that modern diffusion models and contrastive objectives are synergistic when combined with high-coverage, high-consistency data. MegaStyle-1.4M and its associated benchmarks are poised to serve as definitive standards in style transfer and retrieval research (Gao et al., 9 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MegaStyle-1.4M.