Chinese Short Video Datasets Overview

Updated 8 November 2025

Chinese short video datasets are curated collections capturing multimodal content from platforms like Douyin and Bilibili, offering rich behavioral and social attributes.
They integrate diverse modalities—video, audio, text, and user metadata—to support tasks such as retrieval, captioning, fake news detection, and temporal reasoning.
These datasets empower practical research in computer vision, social analytics, and propagation studies, underpinned by detailed benchmarks and cross-platform insights.

Chinese short video datasets are curated corpora that systematically capture the multimodal, behavioral, and social attributes of short-form video content circulating on Chinese platforms such as Douyin, Kuaishou, Bilibili, WeChat Channel, and others. These datasets enable rigorous computational research in computer vision, multimodal machine learning, social science, behavioral analytics, and content propagation. With the rapid evolution of Chinese short video ecosystems, these resources now encompass large-scale annotated videos, rich user and content metadata, multi-platform propagation graphs, and specialized benchmarks supporting downstream tasks from retrieval and captioning to fake news, hate speech, AVSR, and temporally structured video reasoning.

1. Dataset Classes and Collection Paradigms

Chinese short video datasets fall into several distinct classes according to their target application, collection scope, and annotation methodology:

Behavioral-Content-Attribute Datasets: Large-scale corpora with granular user feedback (e.g., likes, watches, follows), comprehensive user demographic/device attributes, and hierarchical video metadata. For example, the dataset from (Shang et al., 9 Feb 2025) comprises 153,561 videos, 10,000 users, and multimodal features including ASR transcripts and 3-level semantic categories.
Multimodal Video-Language Datasets: Massive databases such as Alivol-10M (Lei et al., 2021), containing >10 million professionally produced micro-videos from e-commerce platforms, annotated with high-quality human-written descriptions, multi-level categorization, and auxiliary product images.
Retrieval and Captioning Benchmarks: Datasets like CREATE (Zhang et al., 2022) and CBVS (Qiao et al., 2024), center on short video search/captioning, providing millions of video-title/caption-query pairs, fine-grained tags, and expert tier manual benchmarks with multimodal annotations.
Fake News, Hate Speech, and Sentiment Benchmarks: Corpora such as FakeSV (Qi et al., 2022, Bu et al., 2024) and MultiHateClip (Wang et al., 2024) offer fine-grained labels for veracity, fake/real/other, hate/offensive/normal categories, and include video, transcript, comments, publisher profiles, and segment- and modality-level contextual labels.
Propagation and Social Influence Networks: XS-Video (Xue et al., 31 Mar 2025) is unique in its cross-platform network scope (117,720 videos from 5 major platforms), integrating full engagement indicators (views, likes, shares, collects, comments, comment content) and long-term propagation influence rating (0–9 scale) aligned across platforms.
Structured Temporal Reasoning Benchmarks: ShortVid-Bench (Ge et al., 28 Jul 2025) introduces a deeply annotated evaluation suite for structured comprehension, covering temporal, affective, narrative, creative, and reasoning axes, built on millions of real-world Chinese short videos with timestamped multimodal labels.

These classes can be further subdivided by annotation depth (automatic/manual), modality coverage (video, audio, text, user profile, speech transcript), and granularity (per-frame, per-clip, event-level).

2. Annotation Methodologies and Data Attributes

Annotation methods in Chinese short video datasets integrate both automated and manual protocols, spanning multiple modalities and attribute types.

Visual Content: Frame-level and video-level annotation using ViT, RepVGG, InternVL-2.5, etc., for visual feature extraction, scene segmentation, and OCR overlays.
Audio: ASR systems (Whisper-v3, SenseVoice, etc.) provide transcript alignment; audio emotion classification via wav2vec2 and log-mel spectrograms; multi-speaker/multi-language separation.
Textual Metadata: Titles (catchy, user-facing), captions (informative, factual), tags (multiple POS, domain taxonomy), OCR-extracted subtitles, and user-generated queries.
Behavioral/User-Side Attributes: Gender, age, device model, city level, community type; engagement time series (views, comments, likes).
Interaction Indices for Propagation Analysis: Views, likes, shares, collects, fans, comments, and textual content from interactions, enabling network graph representation.
Temporal and Multi-dimensional Labels: Timestamped captions, multi-granular event segmentation, chain-of-thought (COT) rationales.
Expert Benchmarking: MCQ and open-ended QA for comprehension, distractor design for discriminative evaluation, and inter-annotator agreement measurement (Cohen’s/Fleiss’ Kappa).

Specialized annotation is applied for domain tasks (fake news: event-based query mining and dual annotator verification; hate speech: segment/victim/modality lattice; coreference: synchronized bounding box to mention mappings; emotion: PANAS-trained crowdsourced annotation).

3. Representative Datasets and Their Technical Features

Table summarizing key datasets:

Dataset	#Videos	Task Focus	Modalities	Unique Features
ShortVideo (Shang et al., 9 Feb 2025)	153,561	Behavior, recommendation	Video, ASR, user, tags	Hierarchical category, raw video files, explicit/implicit feedback
Alivol-10M (Lei et al., 2021)	10.3M	Video-language pretrain	Video, text, categories	E-commerce provenance, professional annotation
CREATE (Zhang et al., 2022)	210K/3M/10M	Retrieval/captioning	Video, title, caption, tags	Title vs caption separation, multi-worker validation
CBVS (Qiao et al., 2024)	5M/10M/20K	Cover-based search	Cover-image, OCR, query/title	User-originated covers, manual annotation for benchmark
FakeSV (Qi et al., 2022)	11.6K	Fake news detection	Video, title, transcript, comments, publisher	Multimodal, event-based structure
XS-Video (Xue et al., 31 Mar 2025)	117,720	Propagation graph	Video, engagement, comments	Cross-platform, influence ratings, graph structure
ShortVid-Bench (Ge et al., 28 Jul 2025)	4.5M+hum.	Temporal reasoning	Video, audio, text, MCQ	Timestamped grounding, COT, structured QA

4. Benchmarking, Evaluation Metrics, and Task Coverage

Chinese short video datasets are linked to a range of downstream and benchmarking tasks. Examples include:

Recommendation: Recall@K, NDCG@K, MAP, evaluated with mainstream GCN/GCN multimodal models (LightGCN, VBPR, BM3) (Shang et al., 9 Feb 2025).
Retrieval/Cross-modal Matching: SumR (R@1+R@5+R@10), mean recall, positive-to-negative ratio (PNR), NDCG, with datasets such as CBVS and ChinaOpen (Qiao et al., 2024, Chen et al., 2023).
Captioning/Title Generation: CIDEr, BLEU-4, ROUGE-L, content/subjective splits (Zhang et al., 2022).
Fake News Detection/Analysis: Macro-F1, accuracy, event-based split, temporal propagation metrics, ablation by textual/visual/metadata features (Qi et al., 2022, Bu et al., 2024).
Emotion Analysis: Accuracy, F1, kappa agreement, facial/video/audio/text fusion; tested on MSEVA/bili-news (Wei et al., 2023).
Keyword Spotting (KWS)/ASR: F1, ATWV, CER (for AVSR), relevance of lip-reading and visual slide context (Yuan et al., 2021, Zhao et al., 21 Apr 2025).
Coreference Resolution: MUC, B³, CEAF_φ4, Recall@K for multimodal clusters (Li et al., 19 Apr 2025).
Propagation Influence: Regression/classification of influence levels (0–9); graph-based modeling evaluated with NetGPT (Xue et al., 31 Mar 2025).
Structured Comprehension/Reasoning: Accuracy, mIoU for temporal grounding, MCQ discrimination rate, and narrative/affective/intent axis evaluation (Ge et al., 28 Jul 2025).

Pretraining and evaluation leverage both automated and expert benchmarks, with tasks tailored for real-world Chinese short video ecosystems.

5. Socio-Technical Significance and Domain-Specific Properties

Chinese short video datasets uniquely reflect the sociocultural, behavioral, and technical landscape of the region's mobile video platforms:

Platform Diversity: Unlike prior YouTube-centric benchmarks, the datasets collect from Douyin, Kuaishou, WeChat Channel, Bilibili, Toutiao, and Xigua, ensuring coverage of the Chinese mobile internet's distinct user populations and styles (Xue et al., 31 Mar 2025, Ge et al., 28 Jul 2025).
User and Content Attribution: Fine-grained attributes (city level, device price, creator statistics, etc.) facilitate studies in social science, behavioral analytics, and network propagation [(Shang et al., 9 Feb 2025), XS-Video].
Cultural/Language Specificity: Annotation of idiomatic and meme-driven content (MultiHateClip, TikTalkCoref), bilingual corpora (English/Chinese text in ChinaOpen, WanJuan), supports cross-cultural and cross-lingual AI research (He et al., 2023, Chen et al., 2023).
Event-Driven and Narrative Analysis: Benchmarks for process-centric tasks (FakingRecipe, ShortVid-Bench) explore narrative construction, temporal event segmentation, creative innovation, and narrative/intent comprehension (Bu et al., 2024, Ge et al., 28 Jul 2025).
Real-time and Temporal Graphs: XS-Video enables propagation graph modeling, short video virality, and SPIR prediction, applying temporal sampling and cross-platform alignment formulas (Xue et al., 31 Mar 2025).
Multimodal Depth: Combining advanced video features (ResNet/ViT), ASR, OCR, lip/speech video, slide context, and user comments, datasets facilitate fusion research in vision-language, audio-visual, and cross-modal modeling [CBVS, Chinese-LiPS, MSEVA].

6. Access, Licensing, and Benchmarking Protocols

Most Chinese short video datasets are public for academic research, with usage restrictions for privacy and non-commercial applications. Leading datasets provide URLs and documentation:

ShortVideo/Behavioral: https://github.com/tsinghua-fib-lab/ShortVideo_dataset
CBVS: https://github.com/QQBrowserVideoSearch/CBVS-UniCLIP
FakeSV: https://github.com/ICTMCG/FakeSV
XS-Video: https://github.com/LivXue/short-video-influence
Chinese-LiPS: https://kiri0824.github.io/Chinese-LiPS/
ChinaOpen: https://ruc-aimc-lab.github.io/ChinaOpen/
WanJuan: https://opendatalab.org.cn/WanJuan1.0

Each corpus includes detailed schemas, data splits, annotation protocols, and benchmarking code for reproducible research; cross-references to statistical tables, evaluation formulas, and benchmark methodology are embedded in their papers.

7. Impact, Limitations, and Future Directions

Chinese short video datasets underpin much recent progress in recommender systems, multimodal models, social science, fake news/hate speech detection, graph learning, and advanced video reasoning for real-world deployment. Their strengths include scale (10M+ videos), attribute and interaction richness, multimodal annotation, and domain specificity.

Current limitations include:

Dataset size (some emotion/fake news corpora are smaller-scale relative to platform volume)
Expensive/complex manual annotation for multimodality (e.g., hate speech segment/victim/modality marking)
Cross-platform dynamic alignment and privacy
Underrepresentation of non-mainstream genres or platform nuances

Future work is expected to expand coverage, enhance semantic richness (e.g., finer temporal event segmentation, deeper narrative embeddings), and strengthen cross-lingual and multimodal pretraining for robust industrial and societal impact.

Chinese short video datasets now constitute a foundational resource for computational research, enabling granular modeling of behavioral, creative, affective, and propagation phenomena unique to China's digital video ecosystem, and providing representative, high-quality benchmarks for state-of-the-art and emerging AI techniques.