MultiAspect-4K-1M: High-Fidelity 4K Corpus
- MultiAspect-4K-1M is a large-scale, native 4K image corpus designed to advance ultra-high resolution text-to-image generation through balanced multi-aspect data.
- Its dual-channel curation pipeline employs AR-aware filtering and human-centric augmentation to ensure high visual, semantic, and technical quality.
- Rich bilingual captions and multimodal metadata support benchmarking, curriculum learning, and comprehensive image quality assessment.
MultiAspect-4K-1M is a large-scale, high-fidelity, native 4K image corpus specifically constructed to advance text-to-image generation at ultra-high resolutions across diverse aspect ratios. Designed and released as the foundation for the UltraFlux text-to-image model, MultiAspect-4K-1M comprises over one million curated 4K images, exhibiting broad and controlled coverage of aspect ratios, dense bilingual (English/Chinese) captions, and richly structured vision-LLM (VLM) and image quality assessment (IQA) metadata. Its design and curation pipeline are intended to provide balanced multi-aspect data for benchmarking and training state-of-the-art generative models, while supporting scalable, metadata-driven curricula and fine-grained quality evaluation (Ye et al., 22 Nov 2025).
1. Corpus Composition and Collection
MultiAspect-4K-1M consists of 1,007,230 images, each with a minimum pixel count of 3840×2160, ensuring near-true 4K native resolutions (average: 4521×4703). The dataset achieves broad, balanced coverage across landscape (e.g., 16:9, 3:2, 4:3), portrait/inverted (e.g., 9:16, 2:3), and square (1:1) aspect ratios, with every bucket containing thousands of images and no single aspect dominating the distribution. Each image undergoes AR bucketing: images are snapped to the nearest of approximately 12 landscape, 4 portrait, or 1 square AR bucket (for example, 5440×3072, 3072×5440, 4096×4096), before center-cropping and resizing for training (Ye et al., 22 Nov 2025).
The collection and curation pipeline comprises two primary channels:
- General AR-Aware Path: Applies a safety filter (NSFW, pixel count ≥3840×2160, preserves native AR), visual semantic quality screening (Q-Align score ≥4.0), aesthetic ranking (top 30% by ArtiMuse), and classical image guards (Sobel flatness, threshold ≥800; Shannon entropy H ≥7.0).
- Human-Centric Augmentation: Identifies images with detected persons using YOLOE, applies the same safety and quality checks, and guarantees at least one person detection.
Final selection deduplicates and merges both channels, yielding one million images with unified, structured metadata.
2. Captions, Language Metadata, and Semantic Labels
Each image in MultiAspect-4K-1M is annotated with highly detailed English captions (average length: 125.1 tokens, generated by Gemini-2.5-Flash LMM) and corresponding Chinese translations via Hunyuan-MT-7B. Post-translation, LLM-based consistency checks reject low-confidence or semantically mismatched translations, maintaining low-noise and high-detail captions. Caption generation is confined to the already curated set, excluding noisy or off-topic samples (Ye et al., 22 Nov 2025).
Subject tags extend beyond image-level labels to include binary “character” flags (open-vocabulary human presence from YOLOE) and additional open-vocabulary object or scene descriptors. The complete per-image data schema is:
| Field | Description |
|---|---|
| resolution | Image pixel dimensions |
| aspect_ratio | AR bucket |
| Q-Align_score | Semantic quality (1.0–5.0), filtered at ≥4.0 |
| ArtiMuse_score | Fine-grained aesthetic rating, top 30% retained |
| flatness | Patch-level Sobel variance, thresholded at 800 |
| entropy | Shannon entropy, H ≥7.0 |
| English_caption | Gemini-2.5 detailed generation |
| Chinese_caption | Hunyuan-MT-7B translation |
| character_flag | YOLOE human detection |
| subject_tags | Open-vocabulary object/scene categories |
3. VLM and IQA Metadata
MultiAspect-4K-1M is distinguished by its deep integration of VLM and IQA supervision:
- Q-Align: An LMM-based visual semantic quality scorer (1.0–5.0), used for both discrete text-level supervision and dataset filtering.
- ArtiMuse: A multimodal LMM estimating aesthetic quality (numeric score plus explanations), driving both curation (top 30%) and curriculum fine-tuning (top 5%).
- Classical Signals: Patch-level flatness (Sobel-variance), and Shannon entropy ensure technical quality and visual diversity.
Metadata enables stratified sampling, AR- and resolution-aware subdivisions, and informed curriculum learning strategies (Ye et al., 22 Nov 2025).
4. Resolution- and Aspect-Ratio-Aware Sampling
Training on MultiAspect-4K-1M enforces strict AR and resolution balance through bucketed sampling. Raw resolutions are matched to the nearest predefined bucket, followed by center-cropping and resizing. Uniform sampling across buckets ensures mini-batches span the entire AR spectrum, preventing model overfitting to common (e.g., square or landscape) ARs. While no explicit probability formula is provided, sampling is “AR-aware” by construction via this bucketing and uniform draw strategy (Ye et al., 22 Nov 2025).
5. Role in UltraFlux Data-Model Co-Design
MultiAspect-4K-1M underpins UltraFlux’s “data–model co-design” principle. The dataset’s controlled multi-AR coverage allows positional encoding schemes (e.g., Resonance 2D RoPE with YaRN) to be trained and extrapolated robustly over wide, tall, and square configurations. Bilingual captions enable broader language-conditioned generation. Rich VLM/IQA metadata supports dynamic, curriculum-based optimization—specifically, Stage-wise Aesthetic Curriculum Learning (SACL):
- Stage 1: Train on the full dataset at all diffusion timesteps ().
- Stage 2: Fine-tune on the top 5% by ArtiMuse, focusing on high-noise () steps controlled by the model prior.
This enables UltraFlux to simultaneously maintain high detail-preservation, resolution generalization, and strong language-image alignment. In benchmark evaluations (Aesthetic-Eval at 4096, multi-AR 4K), UltraFlux trained on MultiAspect-4K-1M consistently outperforms open-source and certain proprietary baselines (Ye et al., 22 Nov 2025).
6. Related Methodologies: Image Quality Assessment and Computational Efficiency
Processing and evaluating 4K-scale image datasets at million-image scale entails substantial computational challenges. Leading IQA architectures, such as the multi-branch DNN in "Assessing UHD Image Quality from Aesthetics, Distortions, and Saliency" (Sun et al., 1 Sep 2024), address these by decomposing quality assessment into separate branches for global aesthetics, local technical distortions (mini-patch sampling), and salient content. With computational complexity at 43.5 GMACs per image and throughput of 0.068 s/image on commodity hardware (RTX 3090), such techniques can process MultiAspect-4K-1M in 19 GPU-hours, demonstrating the feasibility of large-scale real-time analysis.
The metadata collection and post-hoc filtering mechanisms in MultiAspect-4K-1M (Q-Align, ArtiMuse, flatness, entropy) align with these strategies—leveraging both neural and classical IQA to guarantee consistently high technical and semantic quality (Sun et al., 1 Sep 2024, Ye et al., 22 Nov 2025).
7. Broader Implications and Extensions
MultiAspect-4K-1M introduces a scalable AR- and resolution-diverse image corpus with descriptive, bilingual captions and structured multimodal metadata, facilitating:
- Multi-aspect benchmarking for generative models at ultra-high resolutions.
- Curriculum learning and granular quality control through VLM/IQA scores.
- Fair, language-inclusive evaluation across global contexts.
Potential extensions of the MultiAspect-4K-1M paradigm include semantic-importance branches (e.g., explicit face/body detectors), temporal consistency tracking for video, and further compression of computational complexity by integrating more efficient backbones or content-adaptive sampling strategies (Sun et al., 1 Sep 2024). A plausible implication is that future 4K-scale datasets will further intertwine semantic, aesthetic, and technical criteria at million-image scale, accelerating progress in both foundation models and scalable image assessment.