Chinese Short Video Datasets Overview
- Chinese short video datasets are curated collections capturing multimodal content from platforms like Douyin and Bilibili, offering rich behavioral and social attributes.
- They integrate diverse modalities—video, audio, text, and user metadata—to support tasks such as retrieval, captioning, fake news detection, and temporal reasoning.
- These datasets empower practical research in computer vision, social analytics, and propagation studies, underpinned by detailed benchmarks and cross-platform insights.
Chinese short video datasets are curated corpora that systematically capture the multimodal, behavioral, and social attributes of short-form video content circulating on Chinese platforms such as Douyin, Kuaishou, Bilibili, WeChat Channel, and others. These datasets enable rigorous computational research in computer vision, multimodal machine learning, social science, behavioral analytics, and content propagation. With the rapid evolution of Chinese short video ecosystems, these resources now encompass large-scale annotated videos, rich user and content metadata, multi-platform propagation graphs, and specialized benchmarks supporting downstream tasks from retrieval and captioning to fake news, hate speech, AVSR, and temporally structured video reasoning.
1. Dataset Classes and Collection Paradigms
Chinese short video datasets fall into several distinct classes according to their target application, collection scope, and annotation methodology:
- Behavioral-Content-Attribute Datasets: Large-scale corpora with granular user feedback (e.g., likes, watches, follows), comprehensive user demographic/device attributes, and hierarchical video metadata. For example, the dataset from (Shang et al., 9 Feb 2025) comprises 153,561 videos, 10,000 users, and multimodal features including ASR transcripts and 3-level semantic categories.
- Multimodal Video-Language Datasets: Massive databases such as Alivol-10M (Lei et al., 2021), containing >10 million professionally produced micro-videos from e-commerce platforms, annotated with high-quality human-written descriptions, multi-level categorization, and auxiliary product images.
- Retrieval and Captioning Benchmarks: Datasets like CREATE (Zhang et al., 2022) and CBVS (Qiao et al., 19 Jan 2024), center on short video search/captioning, providing millions of video-title/caption-query pairs, fine-grained tags, and expert tier manual benchmarks with multimodal annotations.
- Fake News, Hate Speech, and Sentiment Benchmarks: Corpora such as FakeSV (Qi et al., 2022, Bu et al., 23 Jul 2024) and MultiHateClip (Wang et al., 28 Jul 2024) offer fine-grained labels for veracity, fake/real/other, hate/offensive/normal categories, and include video, transcript, comments, publisher profiles, and segment- and modality-level contextual labels.
- Propagation and Social Influence Networks: XS-Video (Xue et al., 31 Mar 2025) is unique in its cross-platform network scope (117,720 videos from 5 major platforms), integrating full engagement indicators (views, likes, shares, collects, comments, comment content) and long-term propagation influence rating (0–9 scale) aligned across platforms.
- Structured Temporal Reasoning Benchmarks: ShortVid-Bench (Ge et al., 28 Jul 2025) introduces a deeply annotated evaluation suite for structured comprehension, covering temporal, affective, narrative, creative, and reasoning axes, built on millions of real-world Chinese short videos with timestamped multimodal labels.
These classes can be further subdivided by annotation depth (automatic/manual), modality coverage (video, audio, text, user profile, speech transcript), and granularity (per-frame, per-clip, event-level).
2. Annotation Methodologies and Data Attributes
Annotation methods in Chinese short video datasets integrate both automated and manual protocols, spanning multiple modalities and attribute types.
- Visual Content: Frame-level and video-level annotation using ViT, RepVGG, InternVL-2.5, etc., for visual feature extraction, scene segmentation, and OCR overlays.
- Audio: ASR systems (Whisper-v3, SenseVoice, etc.) provide transcript alignment; audio emotion classification via wav2vec2 and log-mel spectrograms; multi-speaker/multi-language separation.
- Textual Metadata: Titles (catchy, user-facing), captions (informative, factual), tags (multiple POS, domain taxonomy), OCR-extracted subtitles, and user-generated queries.
- Behavioral/User-Side Attributes: Gender, age, device model, city level, community type; engagement time series (views, comments, likes).
- Interaction Indices for Propagation Analysis: Views, likes, shares, collects, fans, comments, and textual content from interactions, enabling network graph representation.
- Temporal and Multi-dimensional Labels: Timestamped captions, multi-granular event segmentation, chain-of-thought (COT) rationales.
- Expert Benchmarking: MCQ and open-ended QA for comprehension, distractor design for discriminative evaluation, and inter-annotator agreement measurement (Cohen’s/Fleiss’ Kappa).
Specialized annotation is applied for domain tasks (fake news: event-based query mining and dual annotator verification; hate speech: segment/victim/modality lattice; coreference: synchronized bounding box to mention mappings; emotion: PANAS-trained crowdsourced annotation).
3. Representative Datasets and Their Technical Features
Table summarizing key datasets:
| Dataset | #Videos | Task Focus | Modalities | Unique Features |
|---|---|---|---|---|
| ShortVideo (Shang et al., 9 Feb 2025) | 153,561 | Behavior, recommendation | Video, ASR, user, tags | Hierarchical category, raw video files, explicit/implicit feedback |
| Alivol-10M (Lei et al., 2021) | 10.3M | Video-language pretrain | Video, text, categories | E-commerce provenance, professional annotation |
| CREATE (Zhang et al., 2022) | 210K/3M/10M | Retrieval/captioning | Video, title, caption, tags | Title vs caption separation, multi-worker validation |
| CBVS (Qiao et al., 19 Jan 2024) | 5M/10M/20K | Cover-based search | Cover-image, OCR, query/title | User-originated covers, manual annotation for benchmark |
| FakeSV (Qi et al., 2022) | 11.6K | Fake news detection | Video, title, transcript, comments, publisher | Multimodal, event-based structure |
| XS-Video (Xue et al., 31 Mar 2025) | 117,720 | Propagation graph | Video, engagement, comments | Cross-platform, influence ratings, graph structure |
| ShortVid-Bench (Ge et al., 28 Jul 2025) | 4.5M+hum. | Temporal reasoning | Video, audio, text, MCQ | Timestamped grounding, COT, structured QA |
4. Benchmarking, Evaluation Metrics, and Task Coverage
Chinese short video datasets are linked to a range of downstream and benchmarking tasks. Examples include:
- Recommendation: Recall@K, NDCG@K, MAP, evaluated with mainstream GCN/GCN multimodal models (LightGCN, VBPR, BM3) (Shang et al., 9 Feb 2025).
- Retrieval/Cross-modal Matching: SumR (R@1+R@5+R@10), mean recall, positive-to-negative ratio (PNR), NDCG, with datasets such as CBVS and ChinaOpen (Qiao et al., 19 Jan 2024, Chen et al., 2023).
- Captioning/Title Generation: CIDEr, BLEU-4, ROUGE-L, content/subjective splits (Zhang et al., 2022).
- Fake News Detection/Analysis: Macro-F1, accuracy, event-based split, temporal propagation metrics, ablation by textual/visual/metadata features (Qi et al., 2022, Bu et al., 23 Jul 2024).
- Emotion Analysis: Accuracy, F1, kappa agreement, facial/video/audio/text fusion; tested on MSEVA/bili-news (Wei et al., 2023).
- Keyword Spotting (KWS)/ASR: F1, ATWV, CER (for AVSR), relevance of lip-reading and visual slide context (Yuan et al., 2021, Zhao et al., 21 Apr 2025).
- Coreference Resolution: MUC, B³, CEAF_φ4, Recall@K for multimodal clusters (Li et al., 19 Apr 2025).
- Propagation Influence: Regression/classification of influence levels (0–9); graph-based modeling evaluated with NetGPT (Xue et al., 31 Mar 2025).
- Structured Comprehension/Reasoning: Accuracy, mIoU for temporal grounding, MCQ discrimination rate, and narrative/affective/intent axis evaluation (Ge et al., 28 Jul 2025).
Pretraining and evaluation leverage both automated and expert benchmarks, with tasks tailored for real-world Chinese short video ecosystems.
5. Socio-Technical Significance and Domain-Specific Properties
Chinese short video datasets uniquely reflect the sociocultural, behavioral, and technical landscape of the region's mobile video platforms:
- Platform Diversity: Unlike prior YouTube-centric benchmarks, the datasets collect from Douyin, Kuaishou, WeChat Channel, Bilibili, Toutiao, and Xigua, ensuring coverage of the Chinese mobile internet's distinct user populations and styles (Xue et al., 31 Mar 2025, Ge et al., 28 Jul 2025).
- User and Content Attribution: Fine-grained attributes (city level, device price, creator statistics, etc.) facilitate studies in social science, behavioral analytics, and network propagation [(Shang et al., 9 Feb 2025), XS-Video].
- Cultural/Language Specificity: Annotation of idiomatic and meme-driven content (MultiHateClip, TikTalkCoref), bilingual corpora (English/Chinese text in ChinaOpen, WanJuan), supports cross-cultural and cross-lingual AI research (He et al., 2023, Chen et al., 2023).
- Event-Driven and Narrative Analysis: Benchmarks for process-centric tasks (FakingRecipe, ShortVid-Bench) explore narrative construction, temporal event segmentation, creative innovation, and narrative/intent comprehension (Bu et al., 23 Jul 2024, Ge et al., 28 Jul 2025).
- Real-time and Temporal Graphs: XS-Video enables propagation graph modeling, short video virality, and SPIR prediction, applying temporal sampling and cross-platform alignment formulas (Xue et al., 31 Mar 2025).
- Multimodal Depth: Combining advanced video features (ResNet/ViT), ASR, OCR, lip/speech video, slide context, and user comments, datasets facilitate fusion research in vision-language, audio-visual, and cross-modal modeling [CBVS, Chinese-LiPS, MSEVA].
6. Access, Licensing, and Benchmarking Protocols
Most Chinese short video datasets are public for academic research, with usage restrictions for privacy and non-commercial applications. Leading datasets provide URLs and documentation:
- ShortVideo/Behavioral: https://github.com/tsinghua-fib-lab/ShortVideo_dataset
- CBVS: https://github.com/QQBrowserVideoSearch/CBVS-UniCLIP
- FakeSV: https://github.com/ICTMCG/FakeSV
- XS-Video: https://github.com/LivXue/short-video-influence
- Chinese-LiPS: https://kiri0824.github.io/Chinese-LiPS/
- ChinaOpen: https://ruc-aimc-lab.github.io/ChinaOpen/
- WanJuan: https://opendatalab.org.cn/WanJuan1.0
Each corpus includes detailed schemas, data splits, annotation protocols, and benchmarking code for reproducible research; cross-references to statistical tables, evaluation formulas, and benchmark methodology are embedded in their papers.
7. Impact, Limitations, and Future Directions
Chinese short video datasets underpin much recent progress in recommender systems, multimodal models, social science, fake news/hate speech detection, graph learning, and advanced video reasoning for real-world deployment. Their strengths include scale (10M+ videos), attribute and interaction richness, multimodal annotation, and domain specificity.
Current limitations include:
- Dataset size (some emotion/fake news corpora are smaller-scale relative to platform volume)
- Expensive/complex manual annotation for multimodality (e.g., hate speech segment/victim/modality marking)
- Cross-platform dynamic alignment and privacy
- Underrepresentation of non-mainstream genres or platform nuances
Future work is expected to expand coverage, enhance semantic richness (e.g., finer temporal event segmentation, deeper narrative embeddings), and strengthen cross-lingual and multimodal pretraining for robust industrial and societal impact.
Chinese short video datasets now constitute a foundational resource for computational research, enabling granular modeling of behavioral, creative, affective, and propagation phenomena unique to China's digital video ecosystem, and providing representative, high-quality benchmarks for state-of-the-art and emerging AI techniques.