Chinese Vision-Language Pre-training

Updated 28 February 2026

Chinese vision-language pre-training is a framework that jointly learns visual and textual representations in Chinese using large-scale, high-fidelity multimodal datasets.
It leverages both dual-encoder and unified generative transformer architectures, employing advanced tokenization, multi-stage filtering, and locked-image tuning methodologies.
Innovative techniques like multi-view contrastive learning and feature-map distillation drive superior cross-modal retrieval and zero-shot performance while highlighting challenges in token reduction and generative stability.

Chinese vision-language pre-training (VLP) encompasses computational frameworks, large-scale datasets, and architectural principles for joint learning of visual and textual representations directly in the Chinese language. Unlike the predominantly English-centric VLP landscape exemplified by CLIP and ALIGN, Chinese VLP targets applications ranging from cross-modal retrieval and zero-shot classification to generative captioning and bidirectional text-to-image synthesis, driven by the explosion in scale and diversity of Chinese web and social media data. Advances build upon both contrastive dual-encoder paradigms and unified generative models, with pre-training datasets scaling from tens of millions to billions of image–text pairs acquired via web crawling, machine translation, and high-precision filtering (Gu et al., 2022, Yang et al., 2022, Shen et al., 15 Jan 2026, Shan et al., 2022, Zhang et al., 2021). Methodological innovations include contrastive and multi-view alignment objectives, robust tokenization strategies for Chinese, pipeline optimizations for filtering noise, and scalable evaluation on both curated and open-domain Chinese visual-language benchmarks.

1. Large-Scale Chinese Vision-Language Datasets

The evolution of Chinese VLP has been fundamentally enabled by the construction of massive, high-fidelity image–text, video–text, and multimodal datasets:

Wukong dataset: Contains ~100 million web-crawled Chinese image–text pairs, curated from Baidu search results. The pipeline incorporates linguistic/visual filtering, frequent-string suppression, and explicit sensitive term exclusion. Final splits include a human-verified test set (Wukong-Test, 33,365 pairs). Average caption length is 14 tokens, with long-tail topical coverage exceeding 50,000 concepts (Gu et al., 2022).
DanQing: Derived from 2024–25 Common Crawl (CC), DanQing features 100 million pairs after rigorous automated filtering. Steps include coarse and fine text/image filtering, semantic alignment via CLIP-like scores, and large-scale deduplication, with coverage over contemporary Chinese domains (e-commerce, news, social, technical, etc.) (Shen et al., 15 Jan 2026).
Chinese CLIP corpus: Assembles roughly 200M pairs from LAION-5B (Chinese-flagged), Wukong, machine-translated Visual Genome/MSCOCO, and internal high-quality sets. Filters eliminate noise via cross-modal CLIP scores, blacklist rules, and length normalization (Yang et al., 2022).
Youku-mPLUG: Focused on video-language modeling, this set collects 10 million video–text pairs from Youku, with strict safety/diversity controls, detailed frames preprocessing, and extensive manual annotation spanning category classification, captioning, and retrieval (Xu et al., 2023).

The prevalence of multi-stage cleansing, language normalization, and explicit image–text alignment filtering distinguishes Chinese datasets from earlier single-pass filtering protocols in English (e.g., LAION-400M, COYO-700M). As a result, Chinese VLP now supports state-of-the-art pre-training recipes and robust evaluation across zero-shot and retrieval benchmarks. Table 1 summarizes dataset properties:

Dataset	Size (pairs)	Modality	Years	Notable Filtering
Wukong	~100M	Img–Txt	2020s	Multi-level, human test set
DanQing	~100M	Img–Txt	2024–5	Full pipeline, alignment
Chinese CLIP	~200M	Img–Txt	2020s	CLIP score, blacklist
Youku-mPLUG	~10M	Vid–Txt	2020s	Manual review, CLIP align

2. Model Architectures and Pre-training Objectives

Chinese VLP advances utilize both dual-encoder (two-tower) and unified generative transformer frameworks:

Contrastive Dual-Encoder Models:

CLIP-style frameworks feature a vision encoder (ViT, ResNet, Swin) and a text encoder (BERT, RoBERTa, ERNIE), with embeddings projected and normalized before alignment. The primary objective is the symmetric InfoNCE loss on batches of image–text pairs:

$L = -\frac{1}{N}\sum_{i=1}^{N} \log\frac{\exp(\mathrm{sim}(v_i, t_i)/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{sim}(v_i, t_j)/\tau)}$

where $\mathrm{sim}$ denotes dot-product or cosine. - Wukong adopts “token-wise” similarity as in FILIP and a learned token reduction layer for cost-effective alignment (Gu et al., 2022). - ERNIE-ViL 2.0 introduces multi-view contrastive learning, constructing multiple textual, visual, and object-tag views for enriched intra-modal and inter-modal invariance (Shan et al., 2022). - Chinese CLIP applies a two-stage regime: first, “locked-image” tuning with the vision tower frozen (LiT), then full vision–language contrastive tuning (Yang et al., 2022).

Unified Generative Models:

ERNIE-ViLG unifies text-to-image and image-to-text modeling in a single 48-layer, 10B-param transformer, casting both as conditional autoregressive generation problems over joint sequence spaces of visual (VQGAN-tokenized) and text tokens. No auxiliary contrastive or matching classification loss is applied. Cross-modal alignment emerges via joint modeling over massive, Chinese-centric data (Zhang et al., 2021).

Knowledge Distillation and Lightweight Models:

Recent lightweight designs (e.g., DC-CLIP) adopt a ResNet-50 + MiniLMv2 student distilled from teacher vision/text encoders (typically AltCLIP) and align via a two-stage regime of feature map distillation and locked-image contrastive learning (Zhang et al., 2024). These methods substantially reduce model size (~301M params) and inference footprint while maintaining acceptable performance in Chinese benchmarks.

Video-Language Architectures:

Modular designs (e.g., mPLUG-2, mPLUG-video) use TimeSformer video encoders, learnable visual-abstractor tokens, and decoder-only LLMs (e.g., frozen Bloomz) for compositional captioning/classification, with the majority of parameters frozen in instruction-tuned settings (Xu et al., 2023).

3. Pre-training Strategies, Tokenization, and Data Processing

Optimal pre-training for Chinese VLP requires tailored pipeline designs and tokenization choices:

Tokenization: Character-level spaced WordPiece outperforms word-level (e.g., jieba-based) tokenization in cross-modal Chinese transformers, as it enables finer visual–textual alignment (Gu et al., 2022). Modern vocabularies span >20,000 tokens to capture character, subword, and phrase granularity.
Pre-training regimes:
- Locked-image tuning (LiT): Freeze vision encoder from mature English CLIP or FILIP; align a fresh text tower via contrastive loss using Chinese captions, then unfreeze for full joint tuning (Yang et al., 2022, Gu et al., 2022).
- Progressive multi-view contrastive: ERNIE-ViL 2.0’s regime enriches supervision via augmentations (SimCLR), textual dropout (SimCSE), and object-tag extraction, further mitigating noise in Chinese web data (Shan et al., 2022).
- Feature-map distillation: Lightweight students trained on intermediate/terminal feature representation loss against teachers, enabling edge deployment (Zhang et al., 2024).
Data curation: Multi-pass filtering is now norm—HTML/text normalization, language ID, safety/adult-content filtration, redundancy elimination (union-find, CLIP-embedding thresholding), and explicit cross-modal similarity range selection (Shen et al., 15 Jan 2026). Rigorous alignment filtering and deduplication are empirically linked to improved generalization and robustness, including for up-to-date concept learning (e.g., emerging Chinese memes/products post-2024).

4. Empirical Benchmarks and Evaluation

Chinese VLP is evaluated on both cross-modal retrieval and classification via zero-shot and fine-tuned protocols:

Zero-Shot Image Classification:

On a diverse suite (e.g., Caltech101, CIFAR-10, DTD, Food101, Flowers102, Stanford Cars), top-1 average accuracy with Wukong_ViT-L is 73.0% (Gu et al., 2022), SigLIP2-B/32 with DanQing is 65.4% (vs. Wukong 63.5%, English LAION 57.8%) (Shen et al., 15 Jan 2026), ViT-H/14 variant of Chinese CLIP reaches 62.3% (Yang et al., 2022).

Cross-Modal Image–Text Retrieval:

Recall@1 for Text→Image on MUGE is 54.8% for DanQing (SigLIP2-B/32) vs. 38.3% for Wukong (Shen et al., 15 Jan 2026).
AIC-ICC mean recall reaches 71.6% for Wukong_ViT-L (zero-shot), 40.6% for ERNIE-ViL 2.0, both outperforming prior SOTA by >5–12 percentage points (Gu et al., 2022, Shan et al., 2022).
Fine-tuned retrieval metrics show further gains, with Chinese CLIP ViT-H/14 achieving Text→Image R@1 = 81.5% on COCO-CN (Yang et al., 2022).

Video–Language Tasks:

On human-annotated Youku-mPLUG benchmarks, mPLUG-video reaches 80.5% Top-1 category accuracy and sets new CIDEr SOTA (68.9) in video captioning; classification gains post–pre-training are up to +23.1% (Xu et al., 2023).

LMM-based Evaluation:

When serving as vision encoder in LMM frameworks (e.g., LLaVA-NeXT), DanQing-based models deliver top LMM accuracy (50.1%) on Chinese multimodal benchmarks (Shen et al., 15 Jan 2026).

5. Innovations, Insights, and Current Limitations

Key Chinese VLP innovations include:

Dataset freshness (DanQing): Temporal recency (2024–25 data) yields superior generalization to emerging Chinese concepts and events (Shen et al., 15 Jan 2026).
Multi-view supervision (ERNIE-ViL 2.0): Combining multiple textual/visual object-tag views outperforms single-view dual-encoders by up to +2.3 Recall@K (Shan et al., 2022).
Locked-image tuning: Efficiently adapts high-resource English vision towers to Chinese tasks with minimal training disruption (Yang et al., 2022, Gu et al., 2022).
Feature-map distillation: Enables deployment of competitive Chinese/English VLP models on memory–constrained hardware (Zhang et al., 2024).

Persisting limitations include:

Token/reduction granularity trade-offs: Fine-grained FILIP-style alignment boosts patch–word matching but incurs O(n₁·n₂) cost; reduction layers mitigate but may limit ultimate alignment (Gu et al., 2022).
Dual-encoder ceiling: Pooling undermines complex reasoning and fine-grained token–region alignment compared to cross-encoder alternatives (Shan et al., 2022).
Generative instability: ERNIE-ViLG’s unified generative transformer, while advancing bidirectional capacity, suffers instability at extreme scale for end-to-end text→image (Zhang et al., 2021).
Data scale plateau: While expansion beyond 100M pairs continues to help, returns may slow beyond certain thresholds absent further innovation in domain coverage or multimodal alignment (Yang et al., 2022, Shan et al., 2022).

6. Future Directions

Ongoing and anticipated areas of development include:

Extending multi-view/object-tag approaches to cross-encoder architectures for richer token–patch reasoning (Shan et al., 2022).
Scalable, fully end-to-end bidirectional generative pre-training harnessing stabilized VQ frameworks and adaptive masking (Zhang et al., 2021).
Incorporation of additional modalities: Audio, video, and scene text recognition to build general-purpose Chinese multimodal LLMs (Xu et al., 2023, Shen et al., 15 Jan 2026).
Dynamic view weighing and sample-specific supervision composition for improved alignment in noisy or long-tail Chinese data (Shan et al., 2022).
Efficient, lightweight VLP models for edge deployment with feature map distillation and mixed-modality compression (Zhang et al., 2024).
Integration with advanced LMM pipelines for zero-shot instruction-following and open-domain question answering in Chinese (Xu et al., 2023, Shen et al., 15 Jan 2026).
Temporal and dialectal robustness—actively curating datasets with high coverage of contemporary events, regional expressions, and emerging Chinese subcultures (Shen et al., 15 Jan 2026).

Chinese vision-language pre-training, driven by continual advances in dataset construction, alignment strategies, and scalable contrastive/generative objectives, anchors a new wave of language-centric multimodal AI for the Sinosphere and provides templates for generalizing VLP beyond English.