MegaStyle: Large-Scale Style Dataset Pipeline
- MegaStyle is a fully automated, scalable data curation pipeline that builds a 1.4M image dataset by pairing fine-grained style and content prompts.
- It employs systematic prompt deduplication and hierarchical clustering to ensure both intra-style consistency and inter-style diversity across 170,000 unique style prompts.
- The framework integrates style-supervised contrastive learning and a FLUX-based diffusion model, enhancing the performance of style retrieval and transfer tasks.
MegaStyle is a fully automated, scalable data curation pipeline and framework for constructing large-scale style datasets and advancing style-specific representation learning and style transfer. It leverages the consistent text-to-image (T2I) style mapping capabilities of current generative models, particularly Qwen-Image, to generate high-quality, intra-style consistent, and inter-style diverse stylized images from systematically crafted style and content prompts. The pipeline results in the publicly available MegaStyle-1.4M dataset, a comprehensive resource facilitating both the training of style-specialized encoders and the development of advanced style transfer models (Gao et al., 9 Apr 2026).
1. Pipeline Architecture and Data Curation
MegaStyle is structured around an end-to-end automated pipeline designed to ensure both intra-style consistency (images with a shared style prompt) and inter-style diversity (coverage of a broad range of artistic styles). The pipeline operates on large pools of source imagery:
- Style pool: 2 million images comprising 1 million from JourneyDB, 80,000 from WikiArt, and 1 million stylized LAION-Aesthetics.
- Content pool: 2 million non-stylized images from LAION-Aesthetics.
Image captioning is conducted by Qwen3-VL with two specialized prompts:
- Style captions focus on artistic style, color and light distribution, medium, texture, and brushwork (32 words), explicitly capturing characteristics such as "diffuse warm light, oil painting, impasto texture, short directional brushwork."
- Content captions provide object and relational descriptions in 64 words, intentionally omitting all style cues.
A large-scale prompt deduplication phase (Nemo-Curator, including exact, fuzzy, and semantic checks) reduces approximately 4 million raw captions to 1 million unique prompts. Hierarchical k-means clustering is then applied (using mpnet, with four clustering levels: 50,000 → 10,000 → 5,000 → 1,000 clusters) combined with a top-down "cap" sampling scheme: where is the global target, the cluster sizes, and the cap per cluster.
The result is a balanced gallery of 170,000 style prompts (spanning 8,355 distinct "overall style" descriptors) and 400,000 content prompts.
2. MegaStyle-1.4M Dataset Generation and Properties
Stylized images are generated by pairing each of the 170,000 style prompts with 8 random content prompts. Qwen-Image is used to synthesize, for each pair, an image conditioned on the phrase "{content prompt} in the style of {style prompt}," resulting in the MegaStyle-1.4M dataset of 1.4 million stylized images.
Table: Summary of MegaStyle-1.4M Construction
| Attribute | Value |
|---|---|
| Total images | 1,400,000 |
| Style prompts (fine-grained) | 170,000 |
| Images per style (avg) | 8 |
| High-level style descriptors | 8,355 |
Quality control is enforced through comprehensive use of Qwen3-VL in prompt generation, multi-stage deduplication, and balanced sampling procedures. Visual checks confirm the intra-style consistency of generated images, with strong qualitative separation among high-level styles.
3. Style-Supervised Contrastive Learning and Encoder Design
MegaStyle introduces style-supervised contrastive learning (SSCL) to optimize a style-specialized encoder, termed MegaStyle-Encoder, based on a SigLIP vision transformer. Two contrastive objectives are combined:
- Intra-style Supervised Contrastive Loss (SCL): where is the normalized feature for image , indexes images with the same style, and refers to negatives.
- Image–Text Contrastive Loss (ITC): where 0 is the SigLIP text embedding and 1 if 2 matches prompt 3.
The SSCL objective is the sum: 4. Training is conducted with batch size 8192, temperature 5, and spans 30 epochs, updating only image encoder parameters.
4. MegaStyle-FLUX: FLUX-Based Style Transfer
MegaStyle-FLUX adapts the FLUX (MM-DiT) text-to-image diffusion backbone for reference-based style transfer:
- Input pipeline: Reference style image is encoded to patch tokens via a frozen VAE; target (content) image tokens are then noised and combined with style tokens (using shifted RoPE for positional encoding) and text tokens (the content caption) before passing through MM-DiT.
- Only the cross-attention layers of the DiT backbone are fine-tuned (using LoRA rank 128); VAE and text encoder remain frozen.
- Training employs paired images from MegaStyle-1.4M sharing the same style prompt, with one image providing style reference and the other serving as the supervised target for diffusion 6-prediction in latent space (classifier-free guidance, cfg=4.0).
5. Empirical Evaluation and Ablation Studies
MegaStyle-Encoder's performance is evaluated on the StyleRetrieval benchmark (2,400 unseen styles, 32 images per style), achieving:
- mAP@1 ≈ 88.5%, Recall@1 ≈ 88.5%, substantially higher than CSD (mAP@1 ≈ 45.6%) and CLIP/SigLIP baselines (<11%).
MegaStyle-FLUX is assessed on the StyleBench benchmark (50 artworks & 20 prompts):
- Style alignment (cosine similarity in encoder space): 76.16 (StyleShot: 63.42)
- Text alignment (CLIP-text score): 23.20 (best baseline: 23.13)
- Human preference (style/text): 31.4/28.7 (second best: ≈18.2/16.2)
Training data ablations indicate that using MegaStyle-1.4M yields substantial improvements (style: 76.2; text: 23.2) over JourneyDB (style: 34.6; text: 21.1) and OmniStyle-150K (style: 51.5; text: 23.0). Alternative encoder fine-tuning (StyleShot on MegaStyle-1.4M) further increases style alignment to 67.7 (original: 57.1).
6. Contributions, Significance, and Known Limitations
MegaStyle constitutes the first fully automated pipeline to construct a dataset at the scale and intra-style consistency enabled by large modern T2I models, providing a resource of 1.4M style-consistent image pairs across 170,000 fine-grained prompts. SSCL enables the development of robust, expressive style encoders, outperforming previous models by a substantial margin on standard benchmarks, while the FLUX-based style transfer model generalizes effectively to both unseen styles and real artworks.
Notable limitations include the tendency of current vision-LLMs (VLMs) to produce vague style captions and the presence of association biases within Qwen-Image outputs (e.g., overrepresentation of historical Japanese figures in particular style contexts). Planned future work includes refinement of VLM instruction prompts and further scaling of the pipeline to tens of millions of styles (Gao et al., 9 Apr 2026).