Large-scale Mobile Short-video Dataset
- Large-scale mobile short-video datasets are purpose-built collections integrating user behavior, detailed video metadata, and raw content for a multifaceted view of digital engagement.
- They enable multimodal analysis by incorporating visual embeddings, ASR-generated transcripts, hierarchical content taxonomy, and extensive interaction logs.
- Rigorous technical validation and benchmarking confirm their utility in enhancing recommendation systems, user modeling, and computational social science studies.
Large-scale datasets for mobile short-video platforms are purpose-built collections that capture the multifaceted interaction, content, and user attribute space of contemporary mobile-first short-video environments. These datasets address the demands of computational social science, personalized recommendation, user modeling, and multimodal content analysis. Distinguished by their focus on behavioral logs, user demographics, hierarchical and multimodal content attributes, and fine-grained interaction types, such resources respond to limitations in existing public datasets that either lack comprehensive behavioral data, granular user descriptors, or raw video content.
1. Dataset Composition and Granularity
Large-scale mobile short-video datasets typically integrate three central data classes: user behavior, user/video attributes, and video content. A representative instantiation covers:
- Users: 10,000 voluntary, anonymized users (≥20 years old), stratified by gender, geography (city/city-tier), and device information (model and price).
- Videos: 153,561 unique video assets (mostly <1 minute, max >5 minutes), mapped to a three-level content ontology: Category I (37 superclasses, e.g., Sports), Category II (281 topics, e.g., Ball Game), Category III (382 subtypes, e.g., Hockey).
- Interactions: 1,019,568 user–video interaction records over six months, each including action type (explicit: like, comment, follow, forward, collect, hate/dislike; implicit: watch time), user and video IDs, and timestamps.
- Content Corpus: Raw video files totaling approximately 3.2 TB (3,998 hours), all accessible for research under original URL policies to address copyright.
| Entity | Count / Scope | Annotation Dimensions |
|---|---|---|
| Users | 10,000 | 6 attributes (demographics, geo, device) |
| Videos | 153,561 | 9 attributes (multi-level category, author, duration, title, tags) |
| Interactions | 1,019,568 | 7 behaviors (explicit/implicit), time |
| Authors | 81,870 | Linked to video content |
Distinctive is the dataset’s breadth—spanning detailed behavioral logs, extended attribute dictionaries (373 cities, 5 city tiers, community type), as well as fine-grained video meta and content. The system is scalable, both in asset volume and in dimensional coverage, aligning with industry-scale resources (WeChat Channels, TikTok competition datasets).
2. Multimodal Content Representation
Rich video content is a hallmark. For each video:
- Visual: Video divided into 8 temporal clips; 256-dimensional visual embeddings extracted from each via pre-trained ResNet and Vision Transformer (ViT), enabling semantic video retrieval and t-SNE-based cluster validation.
- Audio/Text: Automated Speech Recognition (ASR) generates transcripts (Chinese→English translation via LLaMA3-8B) per video, supporting NLP and multilingual analysis.
- Metadata: Titles (140,341 unique), tags (79,705 manually added), and author references are included.
- Content Access: To mitigate copyright, original URLs are supplied; full videos are stored for permitted research use.
The result is a dataset suited not only to interaction or recommendation research, but also to content-based retrieval, cross-modal analysis, and multimodal representation learning.
3. Technical Validation and Benchmarking
A rigorous four-pronged technical validation substantiates data integrity and utility:
- Behavior / Attribute Richness: Long-tailed usage confirms behavioral diversity (most users/videos are light, a few are intense); demographic and device distributions mirror platform-wide statistics, attesting to representativeness.
- Content Feature Representation: t-SNE projections of visual embeddings yield pronounced category-based clusters at both coarse and fine taxonomic levels, evidencing that learned content features are semantically meaningful.
- Recommendation Algorithms: Comprehensive benchmarking of 8 state-of-the-art (general and multimodal) recommender models (BPR, LightGCN, LayerGCN, VBPR, MMGCN, GRCN, LGMRec, BM3) using Recall@10/20 and NDCG@10/20. Multimodal method BM3 ranks highest (Recall@10 = 0.0238 vs. BPR 0.0113), confirming that multimodal data advances downstream performance.
- Filter Bubble Analysis: A principled metric quantifies content exposure narrowing over time via three-level category diversity:
For active users (≥3 videos/day), this ratio remains stable; for inactive users, it increases, challenging assumptions that high engagement intensifies filter effects.
4. Research Applications and Scientific Utility
These datasets have broad applicability across several research fields:
- User Modeling: Construction and validation of user-profile, sequential behavior, and preference models by leveraging explicit and implicit feedback enriched with demography, location, and device context.
- Computational Social Science: Enables studies of demographic fairness, regional variation, algorithmic bias, filter bubbles, and questions of information addiction or wellbeing.
- Human Behavioral Understanding: Joint logs of behavior, attributes, and content provide the empirical substrate for micro and macro-level studies of digital engagement, attention, information diet, and digital socialization.
- Recommendation and Retrieval: Comprehensive multimodal content and granular user feedback allow rigorous development and evaluation of advanced recommendation engines, including those leveraging text, vision, and audio signals.
5. Unique Attributes Compared to Prior Art
This paradigm-setting dataset offers several advances over predecessors:
- Joint presence of detailed behavior logs, user attribute tables, and full raw video content—datasets in the space typically lack one or more of these dimensions.
- The hierarchical content taxonomy (3-layer, >300 subtypes) enables analytic granularity for genre/topic diversity and filter bubble research.
- Multimodal content (vision, audio, text) is consistently parsed, partitioned, and represented via contemporary neural architectures (ResNet, ViT, modern ASR).
- Technical validation spans not only input data richness and model benchmarks but also sociological phenomena (filter bubbles), establishing not only resource breadth but scientific readiness.
- Privacy and Ethics: Anonymization, opt-in consent, exclusion of minors (<20), and device privacy safeguards. No personally identifiable information is distributed.
- Domain-aligned scale: Approximates industrial datasets in both volume and diversity.
6. Limitations and Access Considerations
While comprehensive, the dataset excludes users under 20 and depends on voluntary participation, potentially biasing toward more privacy-conscious or digitally literate demographics. All data is anonymized, with device identifiers stripped except for model type. Legal and ethical access is orchestrated via informed consent and strict data sharing protocols. Researchers can access the data and code for experimentation, evaluation, and further development through the official repository (https://github.com/tsinghua-fib-lab/ShortVideo_dataset).
7. Implications and Future Directions
Large-scale, multimodal datasets of this nature catalyze a spectrum of research areas, from recommender system design and user engagement modeling to computational social science and studies of algorithmic influence. The inclusion of high-dimensional user and content attributes, especially in a privacy-centered design, positions these resources as cornerstones for not only AI model development but also for empirical investigation of the societal and behavioral impact of mobile short-video platforms. The technical infrastructure and validation strategies outlined may serve as templates for future dataset creation and evaluation across global social video ecosystems (Shang et al., 9 Feb 2025).