XS-Video Dataset: Cross-Platform Analysis
- XS-Video is a large-scale, heterogeneous graph dataset aggregating short-video data from five major Chinese platforms.
- It integrates multimodal features such as video, text, scalar interaction metrics, and comment content for comprehensive influence analysis.
- The dataset supports the SPIR task with advanced graph modeling via NetGPT, achieving superior predictive performance and robust propagation estimation.
The XS-Video dataset is a large-scale, real-world resource for analyzing short-video propagation influence across multiple platforms. Developed for the Short-video Propagation Influence Rating (SPIR) task, XS-Video is the first dataset of its kind to feature comprehensive cross-platform aggregation, multimodal node features including video, text, scalar interaction metrics, and structured comment content. It enables rigorous graph-based modeling of how short videos propagate commercial value, public opinion, and behaviors at population scale.
1. Dataset Scope and Construction
XS-Video aggregates short-video propagation data from five major Chinese platforms: Douyin, Kuaishou, Xigua, Toutiao, and Bilibili, spanning the period from 2024-11-25 to 2025-01-02. The dataset contains:
- 117,720 videos
- 381,926 samples (“states,” each corresponding to a video observed at a distinct crawl time, with ≥2 days intervals)
- 535 trending topics (serving as topic seeds)
- ~419,374 anonymized users (posters and commenters)
- 923,045 comment nodes (full text and timestamp)
Video durations range from 1 second to 5 minutes, ensuring relevance to typical short-video content standards.
Heterogeneous Graph Structure
XS-Video encodes a heterogeneous propagation graph comprising static nodes (platform, topic, title, description, post timestamp), dynamic nodes (views, likes, shares, collects, comments, fans, sampled at crawl times), and comment nodes (text and time), connected via a diverse set of edges. The graph contains:
| Node Type | Count |
|---|---|
| video (MP4) | 381,926 |
| platform | 5 |
| topic | 535 |
| title/desc | 381,926 each |
| comment | 923,045 |
| interaction metrics (views, likes, etc.) | 381,926 each |
| Edge Type | Count |
|---|---|
| standard (“is_*_of”) | N/A |
| has_same_author_as | 5,372,152 |
| has_same_topic_as | 1,655,971,954 |
| is_history_of | 484,364 |
The total graph size comprises 5,506,697 nodes and 1,667,716,553 edges.
Indicator Alignment and Annotations
Cross-platform indicator scaling mitigates platform-specific biases via least squares minimization, using Kuaishou as a reference. For each interaction type on platform :
Resulting statistics are aligned across platforms prior to influence annotation.
After two weeks, each video receives a rating via an indicator thresholding rule:
where is the th indicator and is the th-level threshold.
2. SPIR Task Formalization
SPIR reframes popularity prediction as estimation of long-term propagation influence on a 10-level ordinal scale. Given a video —with content features, partial interaction history, and local graph structure—the goal is to predict at two-week horizon.
The underlying graph is , where:
- : all nodes (videos, attributes, interactions, comments, etc.)
- : directed edges (see Tables above)
- : short-video nodes to be rated
- : ground-truth influence labels
Labeling function integrates platform-aligned indicators followed by threshold comparison for level assignment.
3. Feature Extraction and Graph Representation
Raw feature encoding includes:
- Video nodes: AvgPool(ViT(video))
- Text nodes: RoBERTa(text)
- Timestamp nodes: sinusoidal positional encoding
- Scalar nodes: for likes/views/etc.
- Comment nodes: concatenated text + timestamp
The graph structure employs a two-layer RGCN to integrate heterogeneous relations:
with initializations , producing in .
4. Model Training: NetGPT Framework
NetGPT is a Large Graph Model uniting heterogeneous GNN encoding with a vision-language LLM (Qwen2-VL), optimized via three sequential stages:
Stage 1: Heterogeneous Graph Pretraining
- Feature extraction/propagation with RGCN.
- Continuous SPIR prediction head: .
- loss over training set.
Stage 2: Supervised Language Fine-Tuning
- Project graph features to LLM token space: ().
- Prompt embeds as placeholder token; JSON includes essential metadata.
- Only projector weights are updated to maximize generation likelihood; both GNN and base LLM are frozen.
Stage 3: Task-Oriented Predictor Fine-Tuning
- Last-token LLM hidden state used in regression head: .
- loss; joint tuning of projector, regression head, and last four LLM layers.
5. Performance Evaluation and Comparative Analysis
Evaluation employs both ordinal classification and regression metrics:
- Accuracy (ACC): Fraction of samples with correctly rounded prediction
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
On the XS-Video test set, NetGPT outperforms all baselines:
| Model | Input | ACC | MSE | MAE |
|---|---|---|---|---|
| GCN | Graph | 0.4474 | 1.0623 | 0.7599 |
| HAN | Graph | 0.2619 | 2.6666 | 1.2724 |
| HetSANN | Graph | 0.5078 | 0.8917 | 0.6803 |
| RGCN | Graph | 0.6313 | 0.7801 | 0.5844 |
| Qwen2-VL | Text+Video | 0.5884 | 1.6820 | 0.6629 |
| NetGPT | Graph+Text | 0.6777 | 0.7169 | 0.5457 |
Ablation experiments reveal sensitivity to video node features, video–video and interaction edges, LLM alignment, and LLM model size. NetGPT achieves a 7.3% relative ACC gain over RGCN, and reduces MSE/MAE by 8.1%/6.6%. Compared to Qwen2-VL (state-of-the-art multimodal LLM baseline), NetGPT improves ACC by 15.2%, cuts MSE by 57.4%, and MAE by 17.7%.
Temporal subgroup analysis for short (~3 days), median (~7 days), and long-term (>7 days) predictions further demonstrates consistent error reduction and variance control with NetGPT relative to RGCN, particularly for longer observation windows.
6. Strengths, Limitations, and Applications
XS-Video’s strengths include:
- Heterogeneous multimodal data fusion: combines video, text, temporal, scalar, and comment features within a large propagation graph spanning 5 platforms.
- Three-stage NetGPT training: enables integration of GNN-based graph reasoning with LLM world knowledge; preserves pretrained weights and facilitates robust multimodal alignment.
- Dataset realism and scale: 5.5 million nodes, 1.7 billion edges; full representation of all interaction types and comment content, eclipsing prior datasets.
- Superior predictive performance: NetGPT bridges performance gaps between GNNs and LLMs on SPIR.
Limitations include:
- Computational demands: Requires 8×80 GB GPUs and DeepSpeed pipeline parallelization for efficient training.
- Discrete label scale: 10-level SPIR ratings do not resolve fine-grained propagation dynamics.
- Input sequence constraints: Very long graph-encoded sequences stress LLM context windows.
- Threshold-based artifact risks: Label boundaries may induce discontinuities in predictions.
- Platform drift: Continuous re-annotation may be necessary as platform conventions and content types evolve.
Potential applications facilitated by XS-Video and NetGPT encompass:
- Early ROI estimation in advertising and marketing for short-video feeds
- Recommendation engine enhancement via long-term influence predictions
- Rapid identification of viral carriers in misinformation/public-interest monitoring
- Influencer ranking and network analysis
- User behavior and community trend modeling
- Cross-platform propagation forecasting for viral bridge detection
7. Significance for Short-Video Propagation Research
XS-Video sets a new benchmark in short-video propagation analysis, advancing from conventional single-metric, single-platform forecasting to multi-dimensional, cross-platform influence rating. By providing an extensive, real-world, heterogeneous graph dataset and robust modeling frameworks—most notably NetGPT—XS-Video catalyzes empirical investigation of commercial value estimation, public opinion tracking, recommendation systems, ROI prediction, and networked trend analysis at unprecedented scale and fidelity. This resource addresses prevailing limitations in previous datasets and models, and its graph/LLM fusion methodology is broadly extensible to related propagation and influence estimation domains.