Short-Video Propagation Influence Rating

Updated 12 November 2025

Short-video Propagation Influence Rating (SPIR) is a framework that predicts a ten-level influence score based on cross-platform metrics like views, likes, and shares.
It leverages the XS-Video dataset, a large-scale benchmark capturing heterogeneous propagation data across major short-video platforms with detailed annotations.
NetGPT, the integrated model, combines a heterogeneous graph convolutional network with LLM-based reasoning to achieve superior accuracy and lower predictive error.

Short-video Propagation Influence Rating (SPIR) addresses the problem of forecasting the long-term, multidimensional influence of short-video content across multiple online platforms. Formally, given a newly posted short-video within a large heterogeneous propagation network, SPIR seeks to predict its future influence as an integer rating from 0 to 9, integrating cross-platform views, likes, shares, collects, comments, and followership statistics over a multi-week time frame. This task transcends single-scalar popularity forecasting, instead requiring models to reason over rich, highly structured, and heterogeneous data. The SPIR benchmark is instantiated with the XS-Video dataset, which provides a large cross-platform propagation graph, and is addressed with NetGPT, a large graph model composed of a heterogeneous GNN backbone integrated with LLM-based reasoning.

1. SPIR: Formal Task Definition

Let $G=(V,E,S)$ denote the propagation graph, where $V$ is the set of nodes—including videos, platforms, topics, comments, interaction statistics, and other metadata— $E$ is the set of directed, heterogeneous edges encoding multimodal relations, and $S \subseteq V$ is the set of video nodes of interest. For each video node $v \in S$ , the SPIR task requires a predictor $\hat{y}_v \approx y_v$ , where $y_v \in \{0,1,\ldots,9\}$ is the true, multi-dimensional influence score of the video as aggregated across five platforms in the two weeks post-publication.

Unlike traditional popularity prediction tasks that forecast a scalar indicator (e.g., number of views or likes) within a short window, SPIR fuses the relative and absolute magnitudes of views, likes, shares, collects, comments, and follower count, across heterogeneous platforms with varying statistical profiles, to create a discrete, ten-level influence rating.

2. XS-Video Dataset: Construction and Annotation

XS-Video is the first large-scale, cross-platform short-video propagation dataset. Its composition is as follows:

Statistic	Value
Platforms	Douyin, Kuaishou, Xigua, Toutiao, Bilibili
Distinct Short-videos	117,720
Samples	381,926
Topics	535
Features per Sample	Views, likes, shares, collects, comments, fans, comment text
Content Modalities	Video (MP4), title, description, topic, times, durations

Data collection spanned from 2024-11-25 to 2025-01-02; newly posted videos were collected daily for each trending topic and subsequently re-sampled every two days up to two weeks. All user identifiers were anonymized.

Cross-platform alignment is necessary due to heterogeneity in audience size and engagement rates (e.g., Douyin’s DAU $\sim$ 0.6B vs. Xigua's MAU $\sim$ 0.2B). For each platform $*$ and set of multi-posted videos $S_*$ , alignment factors $\alpha_*$ are computed by minimizing the mean squared percentage error (MSPE) between raw indicators from Kuaishou ("central platform") and those on platform $*$ . Specifically:

$\alpha_* = \arg\min_{\alpha} \frac{1}{|S_*|} \sum_{s\in S_*} \left[ \frac{Ind_{Kuaishou}(s) - \alpha \cdot Ind_{*}(s)}{Ind_{Kuaishou}(s)+1} \right]^2$

Subsequently, all indicators for each video are rescaled to the Kuaishou reference, and the final two-week aggregates determine ratings via stepped thresholds: if all five indicators are zero, $y_v=0$ ; otherwise, the level $l \in \{1, ..., 9\}$ is the smallest index for which any aligned indicator surpasses the $l$ -th threshold (e.g., for $l=3$ require views $> 10^5$ , likes $>20$ ). This process produces a label distribution naturally peaked at level 3.

3. Propagation Graph and Labeling Formalism

The propagation graph $G=(V, E, S)$ has:

$V = \{$ video, platform, topic, title, description, time, ctime, video_time, comment, likes, collects, views, shares, comments, fans $\}$ ,
$E$ encoding relations: "is_platform_of", "is_topic_of", "has_same_author_as", "has_same_topic_as", "is_history_of", among others, totaling approximately $1.67 \times 10^9$ directed edges.

Ratings are formally induced by the labeling function:

$L(v) = Level\left(\{\alpha_* \cdot Ind_*(v): * \text{ over all platforms}\}\right)$

where $Level(\cdot)$ encodes the stepped-threshold discretization defined above.

4. NetGPT: Model Architecture and Training Regime

NetGPT integrates a heterogeneous relational graph convolutional network (RGCN) with LLMs, proceeding via a three-stage training sequence.

Stage I: Heterogeneous Graph Pretraining

Node representation extraction depends on modality:
- Video: $f_v^{raw} = \text{AvgPool}(\text{ViT}(v)) \in \mathbb{R}^{3584}$
- Text (platform, topic, etc.): $f_v^{raw} = \text{RoBERTa}(v) \in \mathbb{R}^{1024}$
- Time: sinusoidal positional encoding in $\mathbb{R}^{512}$
- Scalar statistics: $f_v^{raw} = \log(v+1) \in \mathbb{R}^1$
- Comments: concatenated ( $\mathbb{R}^{1024}$ for text, $\mathbb{R}^{512}$ for time)
Embeddings are updated via a two-layer RGCN:

$F' = \{ f'_v \} = \text{GNN}(F^{raw}, E), \quad f'_v \in \mathbb{R}^{d_g}$

Pretraining objective:

$\hat{y}_v = 9 \cdot \sigma(W_1 f'_v + b_1)$

$L_{pt} = \frac{1}{|S_{tra}|} \sum_{v \in S_{tra}} SmoothL_1(\hat{y}_v, y_v)$

Stage II: Supervised Language Fine-tuning

The pretrained GNN and LLM (e.g., Qwen2-VL-7B) are frozen.
Graph embedding tokens are projected:

$e_v = W_2 f'_v + b_2, \quad e_v \in \mathbb{R}^{d_{lm}}$

The instruction prompt (tokenized, with <|graph_pad|> embedding replaced by

e_v

) is given as input:

1	"You are a helpful assistant. Graph: (v,E). Please read the JSON-formatted video sample x_v, and predict its final propagation influence level (0–9)."

Training objective:

$L_{slf} = -\sum \log P_{LLM}(\text{"The influence level is } y_v" \mid \text{ins})$

Stage III: Task-oriented Predictor Fine-tuning

Allow last four decoder layers of the LLM and $W_2$ , $b_2$ to update; new regression head is introduced.
Final output:

$\tilde{z}_v = 9 \cdot \sigma(W_3 f_h + b_3)$

where $f_h$ is the final hidden state of the assistant's end-of-sequence token.

Fine-tune using:

$L_{ft} = \frac{1}{|S_{tra}|} \sum SmoothL_1(\tilde{z}_v, y_v)$

5. Evaluation Protocol and Comparative Results

The evaluation treats SPIR both as 10-class classification and as regression. Given test set $\{(y_i, \hat{y}_i) \mid i=1, ..., M\}$ :

Accuracy (ACC): $\mathbb{P}(y_i = \mathrm{round}(\hat{y}_i))$
Mean Squared Error (MSE): $\mathbb{E}[(y_i - \hat{y}_i)^2]$
Mean Absolute Error (MAE): $\mathbb{E}[|y_i - \hat{y}_i|]$

Table: Main Results on XS-Video Test Split (117,720 videos; train/test = 4:1)

Model	Input	ACC	MSE	MAE
GCN	Graph	0.4474	1.0623	0.7599
HAN	Graph	0.2619	2.6666	1.2724
HetSANN	Graph	0.5078	0.8917	0.6803
RGCN	Graph	0.6313	0.7801	0.5844
Mistral-7B	Text	0.5387	2.1000	0.8123
InternLM2.5-7B	Text	0.5268	2.1110	0.8064
Llama-3.1-8B	Text	0.5290	2.1215	0.8081
Qwen2.5-7B	Text	0.5469	2.0820	0.7688
Llava-Next-Video	Text+Video	0.5694	1.8503	0.7315
Qwen2-VL-7B	Text+Video	0.5884	1.6820	0.6629
NetGPT	Graph+Text	0.6777	0.7169	0.5457

NetGPT produces the highest accuracy (+7.3% relative to RGCN), lowest MSE (–8.1%), and lowest MAE (–6.6%) among graph baselines, and outperforms the strongest multimodal LLM (Qwen2-VL) by +15.2% ACC, –57.4% MSE, and –17.7% MAE.

Ablation experiments show that removing video features (NetGPT-V), disabling video–video or interactive edges (NetGPT-VV/IV), omitting comment edges (NetGPT-CV), or skipping Stage II (NetGPT-SLF) each impair performance; likewise, reducing LLM capacity results in degradation.

With less than 3 days of observation, NetGPT retains a lead over RGCN, although predictive error increases for all models. Longer observation periods reduce error as expected.

6. Strengths, Limitations, and Interpretative Context

Strengths:

Comprehensive, real-world propagation dataset spanning five major Chinese short-video platforms, with rich temporal, multimodal, and relational annotation.
Heterogeneous propagation graph comprises 5.5M nodes and 1.7B directed edges, enabling flexible large-graph processing.
The three-stage NetGPT framework expressly integrates multimodal graph representation learning with instruction-based LLM reasoning.

Limitations:

Considerable computational expense due to large-scale RGCN pretraining and LLM fine-tuning (8 × A800 80GB GPUs required).
Some overfitting risk for smaller platforms (e.g., Bilibili) and potential label distribution mismatch between frequent and rare topics.
SPIR rating granularity is fixed at 10 levels; future work could investigate continuous-valued or finer-grained influence metrics.

A plausible implication is that broader deployment of SPIR-like frameworks could face practical resource barriers, and downstream tasks may benefit from adaptive label definitions or platform-specific calibration.

7. Practical Applications and Future Directions

SPIR furnishes an actionable foundation for several applications:

Commercial analytics: anticipatory identification of high-influence content to inform targeted advertising and partnership decisions.
Public-opinion monitoring: projecting potential societal impact of emergent or policy-relevant video content.
Recommendation systems: integrating long-term influence forecasts to improve engagement and ranking quality.
User-behavior modeling: capturing the transfer and amplification dynamics of user interactions and content virality in heterogeneous networks.

This suggests further exploration of continuous influence metrics, better transferability across platforms, and incorporation into production systems for content moderation, trending detection, or influence maximization.

In sum, the SPIR framework and XS-Video dataset provide a reproducible, large-scale benchmark for forecasting multidimensional, cross-platform short-video influence. The NetGPT pipeline demonstrates the feasibility and benefits of unifying heterogeneous graph-based pretraining with LLM reasoning for long-term propagation analysis (Xue et al., 31 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Short-video Propagation Influence Rating (SPIR).