DanQing: Chinese Vision–Language Dataset
- DanQing dataset is a large-scale, contemporaneous Chinese vision–language pre-training resource comprising 100M high-quality image–text pairs from 2024–2025 web crawls.
- Its rigorous multi-stage filtering pipeline, including coarse and fine-grained text and image screening plus cross-modal alignment, retains the top ~9.5% of initially scraped data with 100% post-download success.
- Evaluations demonstrate that models pre-trained on DanQing achieve significant gains in zero-shot classification, cross-modal retrieval, and large multimodal benchmarks over previous Chinese datasets.
DanQing is a large-scale, up-to-date Chinese vision–language pre-training dataset comprising 100 million high-quality image–text pairs collected from Common Crawl web crawls conducted between 2024 and 2025. Its primary purpose is to address the longstanding scarcity of high-quality Chinese image–text data, which has impeded the progress of Chinese vision–language pre-training models relative to their English counterparts. By systematically curating contemporary, semantically rich, and rigorously filtered samples, DanQing enables state-of-the-art performance in cross-modal retrieval, zero-shot classification, and large multimodal model (LMM) evaluations, while also reflecting the latest linguistic and cultural trends such as emergent video games and new consumer products. Released under the permissive Creative Commons CC-BY 4.0 license, DanQing is intended as a foundational resource for both academic and commercial research in Chinese multimodal AI (Shen et al., 15 Jan 2026).
1. Motivation and Distinguishing Features
DanQing was created to overcome the limitations of preceding Chinese vision–language datasets such as Wukong, Zero, and TaiSu. While datasets like COYO-700M and LAION-400M had catalyzed advances in English-based VLP models (e.g., CLIP, SigLIP), Chinese models suffered from outdated corpora and inconsistent quality controls. DanQing’s innovation resides in its rigorous pipeline, strict emphasis on post-2024 data for contemporary relevance, and substantially higher post-download retention and quality rates. Notably, DanQing achieves a 100% post-download success rate from an initially scraped pool of 1.05 billion image–text pairs, ultimately retaining only the top ∼9.5% after multistage filtration, thus ensuring both temporal recency and semantic density. Its focus on up-to-date web content ensures that downstream models can immediately recognize novel cultural, entertainment, and technological entities without rewinding to stale knowledge.
2. Data Collection and Processing Pipeline
The construction of DanQing proceeds in seven parallel batches and encompasses multi-stage filtering at both the text and image levels, as summarized below:
| Step | Initial Pool Size | Post-step Size |
|---|---|---|
| Raw scrape (“zho” tag) | 1.05B pairs | 1.05B pairs |
| Coarse text/image filtering | 1.05B | 706M |
| Successful image downloads | 706M | 475M |
| Fine-grained text filtering | 475M | 397M |
| Image-level filtering | 397M | 178M |
| Cross-modal filtering | 178M | ≈100M |
Key filtration mechanisms are:
- Coarse-grained filter: Content safety (1M-param classifier), text length (5–60 words), and domain blacklisting narrow candidates to 706M.
- Image download success: 67% of URLs yield download success.
- Text filtration (four stages):
- FastText language detection and Simplified Chinese conversion (OpenCC),
- Lexical integrity (removal if lacking nouns or exceeding 5 [UNK] tokens after SigLIP tokenization),
- Minimum text entropy threshold (),
- NSFW and political content exclusion (20M-param NSFW detector, Baidu’s DataBuilder).
- Image filtration: Retention based on aspect ratio (), dimensions (px), visual variance (), blurriness (Laplacian variance ), image entropy (), duplicate clustering (Chinese-CLIP-L14, threshold 0.1), and NSFW/hazardous content removal.
- Cross-modal filtration: Only image–text pairs with CLIP similarity scores in are retained, eliminating weakly aligned or OCR-dominated samples. Cross-batch deduplication ensures dataset uniqueness.
3. Dataset Composition and Statistics
DanQing comprises approximately 2.2 billion Chinese words, with captions uniformly distributed in length from 5 to 60 words (average: 22 tokens), supporting both concise and descriptive labeling. The image resolution spans from 300×300 to 500×500 px for the majority, with a significant long tail exceeding 1024 px, aiding scale-invariant feature learning. Topic modeling conducted via BERTopic, UMAP, and HDBSCAN on a 10M-pair subset uncovers coherent clusters across domains such as fashion, cuisine, travel, education, and agriculture. The top 40 source domains (e.g., alicdn.com, baidu.com, wp.com) reflect a heterogeneous provenance, incorporating e-commerce, media, search, and user-generated content. The dataset’s rigorous filtering maintains high signal-to-noise ratios and captures emerging concepts, such as "Black Myth: Wukong" and "Xiaomi SU7," ensuring models trained on DanQing recognize contemporary entities.
4. Evaluation Protocols and Empirical Performance
DanQing’s efficacy is validated through continual pre-training of the SigLIP2 dual-encoder model (two epochs; 16×A800 GPUs, AdamW, lr , weight decay 0.1, , , batch 768×16, 1000-step warmup, image inputs , text capped at 64 tokens), compared to equivalent 100M-sample subsets from Zero and TaiSu. Evaluations encompass:
- Zero-shot image classification (Caltech101, CIFAR10, etc.): DanQing pre-training provides average top-1 accuracy improvements of +7.6% (B/32), +7.8% (B/16), and +7.7% (L/16) over SigLIP2 baseline, outperforming Wukong by ∼1.9% and Zero by up to 1.0%. For instance, SigLIP2-B/32 achieves 65.4% (DanQing) vs 57.8% (baseline).
- Short-caption cross-modal retrieval (Flickr30K-CN, MSCOCO-CN, MUGE): Measured by Recall@k, DanQing surpasses Wukong and Zero by ∼2.1–2.8% in R@1–R@10. SigLIP2-B/32 reaches R@1=54.2% (DanQing) vs 49.8% (Wukong).
- Long-caption retrieval (DCI-CN, DOCCI-CN): DanQing’s SigLIP2-L/16 sees R@1=57.3%, a 12.8% absolute gain over Wukong.
- LMM performance (MMBench, MME-RW, CMMMU, OCRBench V2): SigLIP2-L/16, as a frozen vision encoder in Qwen2-7B, achieves a 50.1% average score, exceeding previous bests (49.5% Wukong).
DanQing further exhibits higher text semantic density (proportion of nouns/verbs/adjectives, optimal perplexity 50–200) and more uniformly distributed visual clusters, attenuating the long-tail effect.
5. Comparative Analysis with Existing Datasets
Compared to Wukong (100M pairs, ~85% URL success), Zero (250M, ~60% success), and TaiSu (166M, synthetic-augmented), DanQing offers:
- Superior curation: 100% post-download success rate vs partial for others.
- Strict up-to-dateness: Focused exclusively on web data from 2024–2025.
- Data efficiency: Retains only top ∼9.5% after filtration, maximizing signal-to-noise.
- Greater coverage: Enhanced representation of emergent topics and semantic trends. Scaling analyses indicate Wukong’s pretraining utility plateaus beyond 30M samples, while DanQing maintains steady performance gains up to 100M. Model scaling with DanQing produces steeper benefits across parameter ranges from 86M to 1B, indicating high-quality data utility persists as model size increases.
6. Broader Implications and Prospective Extensions
DanQing’s design enables vision–LLMs to rapidly assimilate and infer upon newly emergent cultural and commercial concepts, a capability demonstrated by higher model confidence on contemporary terms. Its open license (CC-BY 4.0) facilitates use and extension in both academic and commercial domains. Plausible avenues for future enhancement include integrating video–text modalities, developing bilingual or domain-targeted variants, and continually updating the corpus to capture linguistic drift. A plausible implication is that regular dataset refreshment will be essential for keeping pace with the evolving Chinese web landscape and maintaining the relevance of downstream models. DanQing is positioned to serve as a pivotal foundation for the next generation of Chinese multimodal AI, supporting advanced visual understanding, enhanced retrieval, and rapid real-world adaptation (Shen et al., 15 Jan 2026).