XS-Video Dataset: Cross-Platform Analysis

Updated 12 November 2025

XS-Video is a large-scale, heterogeneous graph dataset aggregating short-video data from five major Chinese platforms.
It integrates multimodal features such as video, text, scalar interaction metrics, and comment content for comprehensive influence analysis.
The dataset supports the SPIR task with advanced graph modeling via NetGPT, achieving superior predictive performance and robust propagation estimation.

The XS-Video dataset is a large-scale, real-world resource for analyzing short-video propagation influence across multiple platforms. Developed for the Short-video Propagation Influence Rating (SPIR) task, XS-Video is the first dataset of its kind to feature comprehensive cross-platform aggregation, multimodal node features including video, text, scalar interaction metrics, and structured comment content. It enables rigorous graph-based modeling of how short videos propagate commercial value, public opinion, and behaviors at population scale.

1. Dataset Scope and Construction

XS-Video aggregates short-video propagation data from five major Chinese platforms: Douyin, Kuaishou, Xigua, Toutiao, and Bilibili, spanning the period from 2024-11-25 to 2025-01-02. The dataset contains:

117,720 videos
381,926 samples (“states,” each corresponding to a video observed at a distinct crawl time, with ≥2 days intervals)
535 trending topics (serving as topic seeds)
~419,374 anonymized users (posters and commenters)
923,045 comment nodes (full text and timestamp)

Video durations range from 1 second to 5 minutes, ensuring relevance to typical short-video content standards.

Heterogeneous Graph Structure

XS-Video encodes a heterogeneous propagation graph comprising static nodes (platform, topic, title, description, post timestamp), dynamic nodes (views, likes, shares, collects, comments, fans, sampled at crawl times), and comment nodes (text and time), connected via a diverse set of edges. The graph contains:

Node Type	Count
video (MP4)	381,926
platform	5
topic	535
title/desc	381,926 each
comment	923,045
interaction metrics (views, likes, etc.)	381,926 each

Edge Type	Count
standard (“is_*_of”)	N/A
has_same_author_as	5,372,152
has_same_topic_as	1,655,971,954
is_history_of	484,364

The total graph size comprises 5,506,697 nodes and 1,667,716,553 edges.

Indicator Alignment and Annotations

Cross-platform indicator scaling mitigates platform-specific biases via least squares minimization, using Kuaishou as a reference. For each interaction type $Ind_*(s)$ on platform $*$ :

$\alpha_* = \arg\min_{\alpha} \frac{1}{|S_*|} \sum_{s\in S_*} \left[ \frac{Ind_K(s) - \alpha Ind_*(s)}{Ind_K(s) + 1} \right]^2$

Resulting statistics are aligned across platforms prior to influence annotation.

After two weeks, each video receives a rating $y_v \in \{0,1,\dots,9\}$ via an indicator thresholding rule:

$y_v = \max \{ \ell \in \{0,...,9\} \mid \exists k: I^k_v > \tau_\ell^k \}$

where $I^k_v$ is the $k$ th indicator and $\tau_\ell^k$ is the $\ell$ th-level threshold.

2. SPIR Task Formalization

SPIR reframes popularity prediction as estimation of long-term propagation influence on a 10-level ordinal scale. Given a video $v$ —with content features, partial interaction history, and local graph structure—the goal is to predict $y_v$ at two-week horizon.

The underlying graph is $G=(V, E, S, Y)$ , where:

$V$ : all nodes (videos, attributes, interactions, comments, etc.)
$E$ : directed edges (see Tables above)
$S \subset V$ : short-video nodes to be rated
$Y = \{y_v: v \in S\}$ : ground-truth influence labels

Labeling function $\mathcal{L}: G \rightarrow Y$ integrates platform-aligned indicators followed by threshold comparison for level assignment.

3. Feature Extraction and Graph Representation

Raw feature encoding includes:

Video nodes: AvgPool(ViT(video)) $\in \mathbb{R}^{3584}$
Text nodes: RoBERTa(text) $\in \mathbb{R}^{1024}$
Timestamp nodes: sinusoidal positional encoding $\in \mathbb{R}^{512}$
Scalar nodes: $\log(v+1)$ for likes/views/etc. $\in \mathbb{R}^{1}$
Comment nodes: concatenated text + timestamp $\in \mathbb{R}^{1536}$

The graph structure employs a two-layer RGCN to integrate heterogeneous relations:

$\mathbf{h}_v^{(l+1)} = \sigma\left( \sum_{r \in R} \sum_{u \in \mathcal{N}_r(v)} \frac{1}{c_{v,r}} W_r^{(l)} \mathbf{h}_u^{(l)} + W_0^{(l)} \mathbf{h}_v^{(l)} \right)$

with initializations $\mathbf{h}_v^{(0)} = f_v$ , producing $\mathbf{h}_v^{(2)}$ in $\mathbb{R}^{d_g}$ .

4. Model Training: NetGPT Framework

NetGPT is a Large Graph Model uniting heterogeneous GNN encoding with a vision-language LLM (Qwen2-VL), optimized via three sequential stages:

Stage 1: Heterogeneous Graph Pretraining

Feature extraction/propagation with RGCN.
Continuous SPIR prediction head: $\hat{y}_v = 9 \cdot \sigma(W_1 \mathbf{h}_v^{(2)} + b_1)$ .
$\mathrm{SmoothL1}$ loss over training set.

Stage 2: Supervised Language Fine-Tuning

Project graph features to LLM token space: $e_v = W_2 \mathbf{h}_v^{(2)} + b_2$ ( $\mathbb{R}^{d_{lm}}$ ).
Prompt embeds $e_v$ as placeholder token; JSON includes essential metadata.
Only projector weights are updated to maximize generation likelihood; both GNN and base LLM are frozen.

Stage 3: Task-Oriented Predictor Fine-Tuning

Last-token LLM hidden state used in regression head: $\tilde{y}_v = 9 \cdot \sigma (W_3 \mathbf{h}_\mathrm{last} + b_3)$ .
$\mathrm{SmoothL1}$ loss; joint tuning of projector, regression head, and last four LLM layers.

5. Performance Evaluation and Comparative Analysis

Evaluation employs both ordinal classification and regression metrics:

Accuracy (ACC): Fraction of samples with correctly rounded prediction
Mean Squared Error (MSE)
Mean Absolute Error (MAE)

On the XS-Video test set, NetGPT outperforms all baselines:

Model	Input	ACC	MSE	MAE
GCN	Graph	0.4474	1.0623	0.7599
HAN	Graph	0.2619	2.6666	1.2724
HetSANN	Graph	0.5078	0.8917	0.6803
RGCN	Graph	0.6313	0.7801	0.5844
Qwen2-VL	Text+Video	0.5884	1.6820	0.6629
NetGPT	Graph+Text	0.6777	0.7169	0.5457

Ablation experiments reveal sensitivity to video node features, video–video and interaction edges, LLM alignment, and LLM model size. NetGPT achieves a 7.3% relative ACC gain over RGCN, and reduces MSE/MAE by 8.1%/6.6%. Compared to Qwen2-VL (state-of-the-art multimodal LLM baseline), NetGPT improves ACC by 15.2%, cuts MSE by 57.4%, and MAE by 17.7%.

Temporal subgroup analysis for short (~3 days), median (~7 days), and long-term (>7 days) predictions further demonstrates consistent error reduction and variance control with NetGPT relative to RGCN, particularly for longer observation windows.

6. Strengths, Limitations, and Applications

XS-Video’s strengths include:

Heterogeneous multimodal data fusion: combines video, text, temporal, scalar, and comment features within a large propagation graph spanning 5 platforms.
Three-stage NetGPT training: enables integration of GNN-based graph reasoning with LLM world knowledge; preserves pretrained weights and facilitates robust multimodal alignment.
Dataset realism and scale: 5.5 million nodes, 1.7 billion edges; full representation of all interaction types and comment content, eclipsing prior datasets.
Superior predictive performance: NetGPT bridges performance gaps between GNNs and LLMs on SPIR.

Limitations include:

Computational demands: Requires 8×80 GB GPUs and DeepSpeed pipeline parallelization for efficient training.
Discrete label scale: 10-level SPIR ratings do not resolve fine-grained propagation dynamics.
Input sequence constraints: Very long graph-encoded sequences stress LLM context windows.
Threshold-based artifact risks: Label boundaries may induce discontinuities in predictions.
Platform drift: Continuous re-annotation may be necessary as platform conventions and content types evolve.

Potential applications facilitated by XS-Video and NetGPT encompass:

Early ROI estimation in advertising and marketing for short-video feeds
Recommendation engine enhancement via long-term influence predictions
Rapid identification of viral carriers in misinformation/public-interest monitoring
Influencer ranking and network analysis
User behavior and community trend modeling
Cross-platform propagation forecasting for viral bridge detection

7. Significance for Short-Video Propagation Research

XS-Video sets a new benchmark in short-video propagation analysis, advancing from conventional single-metric, single-platform forecasting to multi-dimensional, cross-platform influence rating. By providing an extensive, real-world, heterogeneous graph dataset and robust modeling frameworks—most notably NetGPT—XS-Video catalyzes empirical investigation of commercial value estimation, public opinion tracking, recommendation systems, ROI prediction, and networked trend analysis at unprecedented scale and fidelity. This resource addresses prevailing limitations in previous datasets and models, and its graph/LLM fusion methodology is broadly extensible to related propagation and influence estimation domains.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to XS-Video Dataset.