Chinese Short Video Dataset Overview

Updated 27 September 2025

Chinese short video datasets are comprehensive collections of multimodal, richly annotated videos from diverse platforms, supporting advanced research in AI and social computing.
They integrate both automated and manual annotation protocols to ensure high semantic granularity and robust quality control across various applications such as titling and retrieval.
These datasets enable effective pre-training and benchmarking for multimodal models, driving innovations in video captioning, hate speech detection, propagation modeling, and content moderation.

Chinese short video datasets form the empirical backbone for cutting-edge research in multimodal learning, video understanding, information retrieval, content moderation, social computing, and artificial intelligence. These datasets vary in focus—from video–language alignment and subjective titling to social propagation, fake news detection, emotion analysis, cover-based search, cross-cultural hate speech discrimination, and user–content behavioral modeling. Their construction, annotation, and deployment strategies reveal the complexity and diversity of real-world Chinese video ecosystems and inform the development of new models and evaluation protocols.

1. Major Publicly Documented Chinese Short Video Datasets

A diverse set of large-scale datasets have emerged in recent years to address the unique characteristics of Chinese short video content. They are distinguished by source, multimodal annotation, scale, target application, and often by the technical innovations required for robust model pre-training.

Dataset	Source/Modality	Size/Scope	Annotations/Labels
Alivol-10M	E-commerce platform	10M videos	Titles, abstracts, plot (13), coarse (153), fine (11,529) categories (Lei et al., 2021)
CREATE	User, manual/weakly labeled	210K/3M/10M	Titles, captions, 51 categories, 50K+ tags (Zhang et al., 2022)
FakeSV	Douyin, Kuaishou, fact-check	~5.5K videos	Fake/real/debunked, user comments, publisher profiles (Qi et al., 2022)
ChinaOpen	Bilibili; webly-annotated	50K/1K	Titles, captions, tags (object/action/scene/user), bilingual (Chen et al., 2023)
CBVS	BiliBili, Tencent Video	5M/10M/20K	Cover images/OCR/text, user queries, 32 categories (Qiao et al., 2024)
MultiHateClip	Bilibili/Youtube	2,000 videos	Fine-grained hate/offense/normal; gender group; modality (Wang et al., 2024)
ShortVideo_dataset	Real mobile platform	153K videos	User–video interaction, explicit/implicit feedback, 3-layer categories (Shang et al., 9 Feb 2025)
XS-Video	5 Chinese platforms	117K videos	Multi-platform, propagation levels (0–9), interactions (Xue et al., 31 Mar 2025)
TikTalkCoref	Douyin (TikTok-China)	1,012 dialogs	Multimodal coreference clusters across text/visual (Li et al., 19 Apr 2025)
ARC-Hunyuan-Video	WeChat/TikTok/Academic corpus	4.5M videos	Descriptions, summaries, captions, OCR, audio-text pairs, temporally grounded (Ge et al., 28 Jul 2025)
CLV-HD	YouTube/Water ink animations	1,300 clips	Text-descriptions tailored for landscape painting style (Liu et al., 2024)
SSRA	Douyin search logs	4-level ann.	Fine-grained relevance annotation for query–item pairs (Li et al., 20 Sep 2025)

The above table synthesizes dataset scope. Not all datasets are publicly released, though many offer code or data access to researchers.

2. Annotation Protocols, Semantic Coverage, and Quality Control

Chinese short video datasets show a trend toward rich, multi-level annotation schemes, often involving both automated and manual strategies:

Alivol-10M implements multi-tier labels (plot/product) and pairs videos with manually written titles/abstracts to maximize semantic granularity and e-commerce relevance (Lei et al., 2021).
CREATE annotates videos with both factual captions and creative titles, while systematically tagging content with both coarse and fine-grained hierarchical labels for retrieval and titling (Zhang et al., 2022).
FakeSV annotates news authenticity across content and social context, covering user comments and publisher profiles, and introduces weighted aggregation of social signals (e.g., via attention to like counts) (Qi et al., 2022).
ChinaOpen employs automated cleaning (face-detection, OCR, syntactic parsing, object/action/scene recognition) followed by manual multi-modal annotation and bilingual translation, supporting open-set and subjective retrieval (Chen et al., 2023).
MultiHateClip deploys four-step manual annotation protocols for hate speech segmentation, victim group classification, and modality-specific labeling, accounting for cross-cultural and multi-lingual differences (Wang et al., 2024).
ShortVideo_dataset records all major explicit feedback actions (like, hate, collect, comment) and implicit feedback (watch time), with comprehensive demographic and device metadata (Shang et al., 9 Feb 2025).
TikTalkCoref links text-based mention clusters with bounding boxes of faces in videos to support multimodal coreference (Li et al., 19 Apr 2025).
CBVS annotates query–cover pairs with graded relevance, and corrects OCR output via expert review (Qiao et al., 2024).
XS-Video standardizes cross-platform propagation metrics using mean squared percentage error alignment and heuristic thresholding, supporting propagation-level rating (Xue et al., 31 Mar 2025).

Annotation protocols directly influence the downstream task suitability, e.g., retrieval, titling, recommendation, classification, hate speech, multimodal reasoning.

3. Model Pre-training Strategies Enabled by Datasets

Large-scale and high-quality Chinese short video datasets are instrumental for the development of pre-trained multimodal models:

VICTOR leverages Alivol-10M for contrastive multimodal Transformer pre-training across eight proxy supervision tasks (masking, order modeling, dual alignment, intra-/inter-frame contrast), involving both reconstructive and contrastive losses (Lei et al., 2021).
ALWIG bases its retrieval/relevance alignment and GPT-augmented titling generation on CREATE, achieving two-stream alignment and generation via InfoNCE and auto-regressive cross-entropy (Zhang et al., 2022).
SV-FEND fuses multiple modalities using cross-modal attention transformers and weighted self-attention over social context, forming the basis for improved fake news detection (Qi et al., 2022).
ChinaOpen validates several video captioning and retrieval backbones (GIT, BLIP-2, OFA, GVT); GVT introduces Visual Token Reduction for scaling temporal alignment (Chen et al., 2023).
UniCLIP employs auxiliary presence-guided and semantic-guided encoders to integrate cover text knowledge during training on CBVS, resolving the modality-missing problem for large-scale deployment (Qiao et al., 2024).
ARC-Hunyuan-Video fuses audio, visual, and textual signals at frame-level granularity, employs explicit timestamp overlays, and applies RL post-training (GRPO with KL penalty and IoU reward), resulting in notable advances in temporal grounding, video summarization, and user engagement (Ge et al., 28 Jul 2025).
SSRA pipeline synthesizes domain-adaptive queries with controllable four-level relevance for embedding training, using score model filtering and LLM pairwise consistency checks (Li et al., 20 Sep 2025).
NetGPT in XS-Video aligns RGCN-encoded propagation graphs with LLM token spaces, employing three-stage training (graph pretraining, LLM alignment, prediction fine-tuning), and incorporates regression/classification of propagation influence (Xue et al., 31 Mar 2025).

These frameworks showcase how dataset design impacts the breadth of multimodal learning (cross-modal retrieval, titling, grounding, coreference, hate speech detection, propagation modeling).

4. Evaluation, Benchmarking, and Real-World Impact

Most datasets include robust, task-specific benchmarks:

VICTOR outperforms VideoBERT and UniVL across video retrieval, multi-level classification, recommendation, and captioning (using metrics such as R@10, AUC, BLEU, METEOR, ROUGE-L, CIDEr) (Lei et al., 2021).
CREATE is the foundation for extensive video titling and retrieval benchmarks, enabling evaluation against VATEX-zh and T-VTD (Zhang et al., 2022).
FakeSV’s SV-FEND achieves ~79% accuracy and F1, with ablations confirming the necessity of social context (Qi et al., 2022).
ChinaOpen supports open-set tagging and retrieval/subjectivity evaluation (SumR, BLEU4, METEOR, CIDEr) for both Chinese and English queries (Chen et al., 2023).
MultiHateClip demonstrates that multimodal fusion (text, audio, vision) substantially increases macro F1 and accuracy compared to unimodal models, but also highlights ambiguity in hate/offense distinction (Wang et al., 2024).
ShortVideo_dataset enables benchmarking of traditional and multimodal recommendation strategies (Recall@10/20, NDCG@10/20), and supports quantitative analysis of filter bubble phenomena (Shang et al., 9 Feb 2025).
NetGPT on XS-Video reports superior accuracy, MSE, and MAE for propagation influence rating (Xue et al., 31 Mar 2025).
ARC-Hunyuan-Video shows clear improvements in temporal localization (mIoU, ShortVid-Bench accuracy) and directly impacts click-through rates (CTR) and user engagement in production (Ge et al., 28 Jul 2025).
SSRA-supported embeddings on Douyin yield statistically significant gains in CTR (+1.45%), SRR (+4.9%), and IUPR (+0.1054%), validating fine-grained relevance control (Li et al., 20 Sep 2025).

Such evaluations demonstrate both the practical value and the empirical rigor of these datasets for Chinese short video research.

5. Challenges, Cultural-Specificity, and Research Directions

Key challenges addressed in dataset and benchmark construction include:

Handling linguistic, stylistic, and subjective variations (e.g., titling in CREATE and ChinaOpen).
Ensuring domain-specific and rich label coverage (Alivol-10M hierarchical taxonomy; CBVS OCR text overlays).
Balancing label distribution and mitigating long-tail phenomena (SSRA relevance sampling; XS-Video cross-platform normalization).
Integrating social context and cross-modal signals (FakeSV; TikTalkCoref; MultiHateClip).
Bridging cultural gaps—e.g., gendered hate lexicons (Wang et al., 2024), or annotation consistency for subjective content.
Supporting cross-linguistic research (ChinaOpen, CBVS dual-language annotation).

These technical, linguistic, and social complexities suggest ongoing advances. A plausible implication is that new datasets may continue to incorporate even more granular behavioral, affective, and cultural attributes, and that fusion architectures (audio–visual–textual) will further benefit from training on such detailed corpora.

6. Access, Licensing, and Infrastructure

Many datasets and pre-trained model checkpoints are made openly available to the research community (or promise forthcoming releases) to promote reproducibility and model development, though researchers must verify licensing and data use agreements at their respective portals (e.g., WanJuan, ShortVideo_dataset, CBVS-UniCLIP).

Dataset size, detailed metadata, source diversity (e.g., commercial, user-generated, verified channels), and annotation standards are now sufficiently advanced to support a broad array of novel multimodal, social, and computational tasks specific to the Chinese short video paradigm.

7. Conclusion

Chinese short video datasets have transitioned from ad-hoc, limited collections to highly curated, richly annotated, multimodal corpora spanning millions of examples. These resources enable robust pre-training, domain adaptation, and evaluation for retrieval, classification, generation, propagation modeling, multimodal reasoning, and social understanding in Chinese contexts. Their technical properties—annotation protocols, quality control, semantic coverage, benchmark support, and real-world deployment—define the empirical standards for progress in the field and catalyze further interdisciplinary research across computer vision, natural language processing, social computing, and recommendation systems.