Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 148 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

XS-Video Dataset: Cross-Platform Analysis

Updated 12 November 2025
  • XS-Video is a large-scale, heterogeneous graph dataset aggregating short-video data from five major Chinese platforms.
  • It integrates multimodal features such as video, text, scalar interaction metrics, and comment content for comprehensive influence analysis.
  • The dataset supports the SPIR task with advanced graph modeling via NetGPT, achieving superior predictive performance and robust propagation estimation.

The XS-Video dataset is a large-scale, real-world resource for analyzing short-video propagation influence across multiple platforms. Developed for the Short-video Propagation Influence Rating (SPIR) task, XS-Video is the first dataset of its kind to feature comprehensive cross-platform aggregation, multimodal node features including video, text, scalar interaction metrics, and structured comment content. It enables rigorous graph-based modeling of how short videos propagate commercial value, public opinion, and behaviors at population scale.

1. Dataset Scope and Construction

XS-Video aggregates short-video propagation data from five major Chinese platforms: Douyin, Kuaishou, Xigua, Toutiao, and Bilibili, spanning the period from 2024-11-25 to 2025-01-02. The dataset contains:

  • 117,720 videos
  • 381,926 samples (“states,” each corresponding to a video observed at a distinct crawl time, with ≥2 days intervals)
  • 535 trending topics (serving as topic seeds)
  • ~419,374 anonymized users (posters and commenters)
  • 923,045 comment nodes (full text and timestamp)

Video durations range from 1 second to 5 minutes, ensuring relevance to typical short-video content standards.

Heterogeneous Graph Structure

XS-Video encodes a heterogeneous propagation graph comprising static nodes (platform, topic, title, description, post timestamp), dynamic nodes (views, likes, shares, collects, comments, fans, sampled at crawl times), and comment nodes (text and time), connected via a diverse set of edges. The graph contains:

Node Type Count
video (MP4) 381,926
platform 5
topic 535
title/desc 381,926 each
comment 923,045
interaction metrics (views, likes, etc.) 381,926 each
Edge Type Count
standard (“is_*_of”) N/A
has_same_author_as 5,372,152
has_same_topic_as 1,655,971,954
is_history_of 484,364

The total graph size comprises 5,506,697 nodes and 1,667,716,553 edges.

Indicator Alignment and Annotations

Cross-platform indicator scaling mitigates platform-specific biases via least squares minimization, using Kuaishou as a reference. For each interaction type Ind(s)Ind_*(s) on platform *:

α=argminα1SsS[IndK(s)αInd(s)IndK(s)+1]2\alpha_* = \arg\min_{\alpha} \frac{1}{|S_*|} \sum_{s\in S_*} \left[ \frac{Ind_K(s) - \alpha Ind_*(s)}{Ind_K(s) + 1} \right]^2

Resulting statistics are aligned across platforms prior to influence annotation.

After two weeks, each video receives a rating yv{0,1,,9}y_v \in \{0,1,\dots,9\} via an indicator thresholding rule:

yv=max{{0,...,9}k:Ivk>τk}y_v = \max \{ \ell \in \{0,...,9\} \mid \exists k: I^k_v > \tau_\ell^k \}

where IvkI^k_v is the kkth indicator and τk\tau_\ell^k is the \ellth-level threshold.

2. SPIR Task Formalization

SPIR reframes popularity prediction as estimation of long-term propagation influence on a 10-level ordinal scale. Given a video vv—with content features, partial interaction history, and local graph structure—the goal is to predict yvy_v at two-week horizon.

The underlying graph is G=(V,E,S,Y)G=(V, E, S, Y), where:

  • VV: all nodes (videos, attributes, interactions, comments, etc.)
  • EE: directed edges (see Tables above)
  • SVS \subset V: short-video nodes to be rated
  • Y={yv:vS}Y = \{y_v: v \in S\}: ground-truth influence labels

Labeling function L:GY\mathcal{L}: G \rightarrow Y integrates platform-aligned indicators followed by threshold comparison for level assignment.

3. Feature Extraction and Graph Representation

Raw feature encoding includes:

  • Video nodes: AvgPool(ViT(video)) R3584\in \mathbb{R}^{3584}
  • Text nodes: RoBERTa(text) R1024\in \mathbb{R}^{1024}
  • Timestamp nodes: sinusoidal positional encoding R512\in \mathbb{R}^{512}
  • Scalar nodes: log(v+1)\log(v+1) for likes/views/etc. R1\in \mathbb{R}^{1}
  • Comment nodes: concatenated text + timestamp R1536\in \mathbb{R}^{1536}

The graph structure employs a two-layer RGCN to integrate heterogeneous relations:

hv(l+1)=σ(rRuNr(v)1cv,rWr(l)hu(l)+W0(l)hv(l))\mathbf{h}_v^{(l+1)} = \sigma\left( \sum_{r \in R} \sum_{u \in \mathcal{N}_r(v)} \frac{1}{c_{v,r}} W_r^{(l)} \mathbf{h}_u^{(l)} + W_0^{(l)} \mathbf{h}_v^{(l)} \right)

with initializations hv(0)=fv\mathbf{h}_v^{(0)} = f_v, producing hv(2)\mathbf{h}_v^{(2)} in Rdg\mathbb{R}^{d_g}.

4. Model Training: NetGPT Framework

NetGPT is a Large Graph Model uniting heterogeneous GNN encoding with a vision-language LLM (Qwen2-VL), optimized via three sequential stages:

Stage 1: Heterogeneous Graph Pretraining

  • Feature extraction/propagation with RGCN.
  • Continuous SPIR prediction head: y^v=9σ(W1hv(2)+b1)\hat{y}_v = 9 \cdot \sigma(W_1 \mathbf{h}_v^{(2)} + b_1).
  • SmoothL1\mathrm{SmoothL1} loss over training set.

Stage 2: Supervised Language Fine-Tuning

  • Project graph features to LLM token space: ev=W2hv(2)+b2e_v = W_2 \mathbf{h}_v^{(2)} + b_2 (Rdlm\mathbb{R}^{d_{lm}}).
  • Prompt embeds eve_v as placeholder token; JSON includes essential metadata.
  • Only projector weights are updated to maximize generation likelihood; both GNN and base LLM are frozen.

Stage 3: Task-Oriented Predictor Fine-Tuning

  • Last-token LLM hidden state used in regression head: y~v=9σ(W3hlast+b3)\tilde{y}_v = 9 \cdot \sigma (W_3 \mathbf{h}_\mathrm{last} + b_3).
  • SmoothL1\mathrm{SmoothL1} loss; joint tuning of projector, regression head, and last four LLM layers.

5. Performance Evaluation and Comparative Analysis

Evaluation employs both ordinal classification and regression metrics:

  • Accuracy (ACC): Fraction of samples with correctly rounded prediction
  • Mean Squared Error (MSE)
  • Mean Absolute Error (MAE)

On the XS-Video test set, NetGPT outperforms all baselines:

Model Input ACC MSE MAE
GCN Graph 0.4474 1.0623 0.7599
HAN Graph 0.2619 2.6666 1.2724
HetSANN Graph 0.5078 0.8917 0.6803
RGCN Graph 0.6313 0.7801 0.5844
Qwen2-VL Text+Video 0.5884 1.6820 0.6629
NetGPT Graph+Text 0.6777 0.7169 0.5457

Ablation experiments reveal sensitivity to video node features, video–video and interaction edges, LLM alignment, and LLM model size. NetGPT achieves a 7.3% relative ACC gain over RGCN, and reduces MSE/MAE by 8.1%/6.6%. Compared to Qwen2-VL (state-of-the-art multimodal LLM baseline), NetGPT improves ACC by 15.2%, cuts MSE by 57.4%, and MAE by 17.7%.

Temporal subgroup analysis for short (~3 days), median (~7 days), and long-term (>7 days) predictions further demonstrates consistent error reduction and variance control with NetGPT relative to RGCN, particularly for longer observation windows.

6. Strengths, Limitations, and Applications

XS-Video’s strengths include:

  • Heterogeneous multimodal data fusion: combines video, text, temporal, scalar, and comment features within a large propagation graph spanning 5 platforms.
  • Three-stage NetGPT training: enables integration of GNN-based graph reasoning with LLM world knowledge; preserves pretrained weights and facilitates robust multimodal alignment.
  • Dataset realism and scale: 5.5 million nodes, 1.7 billion edges; full representation of all interaction types and comment content, eclipsing prior datasets.
  • Superior predictive performance: NetGPT bridges performance gaps between GNNs and LLMs on SPIR.

Limitations include:

  • Computational demands: Requires 8×80 GB GPUs and DeepSpeed pipeline parallelization for efficient training.
  • Discrete label scale: 10-level SPIR ratings do not resolve fine-grained propagation dynamics.
  • Input sequence constraints: Very long graph-encoded sequences stress LLM context windows.
  • Threshold-based artifact risks: Label boundaries may induce discontinuities in predictions.
  • Platform drift: Continuous re-annotation may be necessary as platform conventions and content types evolve.

Potential applications facilitated by XS-Video and NetGPT encompass:

  • Early ROI estimation in advertising and marketing for short-video feeds
  • Recommendation engine enhancement via long-term influence predictions
  • Rapid identification of viral carriers in misinformation/public-interest monitoring
  • Influencer ranking and network analysis
  • User behavior and community trend modeling
  • Cross-platform propagation forecasting for viral bridge detection

7. Significance for Short-Video Propagation Research

XS-Video sets a new benchmark in short-video propagation analysis, advancing from conventional single-metric, single-platform forecasting to multi-dimensional, cross-platform influence rating. By providing an extensive, real-world, heterogeneous graph dataset and robust modeling frameworks—most notably NetGPT—XS-Video catalyzes empirical investigation of commercial value estimation, public opinion tracking, recommendation systems, ROI prediction, and networked trend analysis at unprecedented scale and fidelity. This resource addresses prevailing limitations in previous datasets and models, and its graph/LLM fusion methodology is broadly extensible to related propagation and influence estimation domains.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to XS-Video Dataset.