Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 148 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Short-Video Propagation Influence Rating

Updated 12 November 2025
  • Short-video Propagation Influence Rating (SPIR) is a framework that predicts a ten-level influence score based on cross-platform metrics like views, likes, and shares.
  • It leverages the XS-Video dataset, a large-scale benchmark capturing heterogeneous propagation data across major short-video platforms with detailed annotations.
  • NetGPT, the integrated model, combines a heterogeneous graph convolutional network with LLM-based reasoning to achieve superior accuracy and lower predictive error.

Short-video Propagation Influence Rating (SPIR) addresses the problem of forecasting the long-term, multidimensional influence of short-video content across multiple online platforms. Formally, given a newly posted short-video within a large heterogeneous propagation network, SPIR seeks to predict its future influence as an integer rating from 0 to 9, integrating cross-platform views, likes, shares, collects, comments, and followership statistics over a multi-week time frame. This task transcends single-scalar popularity forecasting, instead requiring models to reason over rich, highly structured, and heterogeneous data. The SPIR benchmark is instantiated with the XS-Video dataset, which provides a large cross-platform propagation graph, and is addressed with NetGPT, a large graph model composed of a heterogeneous GNN backbone integrated with LLM-based reasoning.

1. SPIR: Formal Task Definition

Let G=(V,E,S)G=(V,E,S) denote the propagation graph, where VV is the set of nodes—including videos, platforms, topics, comments, interaction statistics, and other metadata—EE is the set of directed, heterogeneous edges encoding multimodal relations, and SVS \subseteq V is the set of video nodes of interest. For each video node vSv \in S, the SPIR task requires a predictor y^vyv\hat{y}_v \approx y_v, where yv{0,1,,9}y_v \in \{0,1,\ldots,9\} is the true, multi-dimensional influence score of the video as aggregated across five platforms in the two weeks post-publication.

Unlike traditional popularity prediction tasks that forecast a scalar indicator (e.g., number of views or likes) within a short window, SPIR fuses the relative and absolute magnitudes of views, likes, shares, collects, comments, and follower count, across heterogeneous platforms with varying statistical profiles, to create a discrete, ten-level influence rating.

2. XS-Video Dataset: Construction and Annotation

XS-Video is the first large-scale, cross-platform short-video propagation dataset. Its composition is as follows:

Statistic Value
Platforms Douyin, Kuaishou, Xigua, Toutiao, Bilibili
Distinct Short-videos 117,720
Samples 381,926
Topics 535
Features per Sample Views, likes, shares, collects, comments, fans, comment text
Content Modalities Video (MP4), title, description, topic, times, durations

Data collection spanned from 2024-11-25 to 2025-01-02; newly posted videos were collected daily for each trending topic and subsequently re-sampled every two days up to two weeks. All user identifiers were anonymized.

Cross-platform alignment is necessary due to heterogeneity in audience size and engagement rates (e.g., Douyin’s DAU \sim 0.6B vs. Xigua's MAU \sim 0.2B). For each platform * and set of multi-posted videos SS_*, alignment factors α\alpha_* are computed by minimizing the mean squared percentage error (MSPE) between raw indicators from Kuaishou ("central platform") and those on platform *. Specifically:

α=argminα1SsS[IndKuaishou(s)αInd(s)IndKuaishou(s)+1]2\alpha_* = \arg\min_{\alpha} \frac{1}{|S_*|} \sum_{s\in S_*} \left[ \frac{Ind_{Kuaishou}(s) - \alpha \cdot Ind_{*}(s)}{Ind_{Kuaishou}(s)+1} \right]^2

Subsequently, all indicators for each video are rescaled to the Kuaishou reference, and the final two-week aggregates determine ratings via stepped thresholds: if all five indicators are zero, yv=0y_v=0; otherwise, the level l{1,...,9}l \in \{1, ..., 9\} is the smallest index for which any aligned indicator surpasses the ll-th threshold (e.g., for l=3l=3 require views >105> 10^5, likes >20>20). This process produces a label distribution naturally peaked at level 3.

3. Propagation Graph and Labeling Formalism

The propagation graph G=(V,E,S)G=(V, E, S) has:

  • V={V = \{video, platform, topic, title, description, time, ctime, video_time, comment, likes, collects, views, shares, comments, fans}\},
  • EE encoding relations: "is_platform_of", "is_topic_of", "has_same_author_as", "has_same_topic_as", "is_history_of", among others, totaling approximately 1.67×1091.67 \times 10^9 directed edges.

Ratings are formally induced by the labeling function:

L(v)=Level({αInd(v): over all platforms})L(v) = Level\left(\{\alpha_* \cdot Ind_*(v): * \text{ over all platforms}\}\right)

where Level()Level(\cdot) encodes the stepped-threshold discretization defined above.

4. NetGPT: Model Architecture and Training Regime

NetGPT integrates a heterogeneous relational graph convolutional network (RGCN) with LLMs, proceeding via a three-stage training sequence.

Stage I: Heterogeneous Graph Pretraining

  • Node representation extraction depends on modality:
    • Video: fvraw=AvgPool(ViT(v))R3584f_v^{raw} = \text{AvgPool}(\text{ViT}(v)) \in \mathbb{R}^{3584}
    • Text (platform, topic, etc.): fvraw=RoBERTa(v)R1024f_v^{raw} = \text{RoBERTa}(v) \in \mathbb{R}^{1024}
    • Time: sinusoidal positional encoding in R512\mathbb{R}^{512}
    • Scalar statistics: fvraw=log(v+1)R1f_v^{raw} = \log(v+1) \in \mathbb{R}^1
    • Comments: concatenated (R1024\mathbb{R}^{1024} for text, R512\mathbb{R}^{512} for time)
  • Embeddings are updated via a two-layer RGCN:

F={fv}=GNN(Fraw,E),fvRdgF' = \{ f'_v \} = \text{GNN}(F^{raw}, E), \quad f'_v \in \mathbb{R}^{d_g}

  • Pretraining objective:

y^v=9σ(W1fv+b1)\hat{y}_v = 9 \cdot \sigma(W_1 f'_v + b_1)

Lpt=1StravStraSmoothL1(y^v,yv)L_{pt} = \frac{1}{|S_{tra}|} \sum_{v \in S_{tra}} SmoothL_1(\hat{y}_v, y_v)

Stage II: Supervised Language Fine-tuning

  • The pretrained GNN and LLM (e.g., Qwen2-VL-7B) are frozen.
  • Graph embedding tokens are projected:

ev=W2fv+b2,evRdlme_v = W_2 f'_v + b_2, \quad e_v \in \mathbb{R}^{d_{lm}}

  • The instruction prompt (tokenized, with <|graph_pad|> embedding replaced by eve_v) is given as input:
    1
    
    "You are a helpful assistant. Graph: (v,E). Please read the JSON-formatted video sample x_v, and predict its final propagation influence level (0–9)."
  • Training objective:

Lslf=logPLLM("The influence level is yv"ins)L_{slf} = -\sum \log P_{LLM}(\text{"The influence level is } y_v" \mid \text{ins})

Stage III: Task-oriented Predictor Fine-tuning

  • Allow last four decoder layers of the LLM and W2W_2, b2b_2 to update; new regression head is introduced.
  • Final output:

z~v=9σ(W3fh+b3)\tilde{z}_v = 9 \cdot \sigma(W_3 f_h + b_3)

where fhf_h is the final hidden state of the assistant's end-of-sequence token.

  • Fine-tune using:

Lft=1StraSmoothL1(z~v,yv)L_{ft} = \frac{1}{|S_{tra}|} \sum SmoothL_1(\tilde{z}_v, y_v)

5. Evaluation Protocol and Comparative Results

The evaluation treats SPIR both as 10-class classification and as regression. Given test set {(yi,y^i)i=1,...,M}\{(y_i, \hat{y}_i) \mid i=1, ..., M\}:

  • Accuracy (ACC): P(yi=round(y^i))\mathbb{P}(y_i = \mathrm{round}(\hat{y}_i))
  • Mean Squared Error (MSE): E[(yiy^i)2]\mathbb{E}[(y_i - \hat{y}_i)^2]
  • Mean Absolute Error (MAE): E[yiy^i]\mathbb{E}[|y_i - \hat{y}_i|]

Table: Main Results on XS-Video Test Split (117,720 videos; train/test = 4:1)

Model Input ACC MSE MAE
GCN Graph 0.4474 1.0623 0.7599
HAN Graph 0.2619 2.6666 1.2724
HetSANN Graph 0.5078 0.8917 0.6803
RGCN Graph 0.6313 0.7801 0.5844
Mistral-7B Text 0.5387 2.1000 0.8123
InternLM2.5-7B Text 0.5268 2.1110 0.8064
Llama-3.1-8B Text 0.5290 2.1215 0.8081
Qwen2.5-7B Text 0.5469 2.0820 0.7688
Llava-Next-Video Text+Video 0.5694 1.8503 0.7315
Qwen2-VL-7B Text+Video 0.5884 1.6820 0.6629
NetGPT Graph+Text 0.6777 0.7169 0.5457

NetGPT produces the highest accuracy (+7.3% relative to RGCN), lowest MSE (–8.1%), and lowest MAE (–6.6%) among graph baselines, and outperforms the strongest multimodal LLM (Qwen2-VL) by +15.2% ACC, –57.4% MSE, and –17.7% MAE.

Ablation experiments show that removing video features (NetGPT-V), disabling video–video or interactive edges (NetGPT-VV/IV), omitting comment edges (NetGPT-CV), or skipping Stage II (NetGPT-SLF) each impair performance; likewise, reducing LLM capacity results in degradation.

With less than 3 days of observation, NetGPT retains a lead over RGCN, although predictive error increases for all models. Longer observation periods reduce error as expected.

6. Strengths, Limitations, and Interpretative Context

Strengths:

  • Comprehensive, real-world propagation dataset spanning five major Chinese short-video platforms, with rich temporal, multimodal, and relational annotation.
  • Heterogeneous propagation graph comprises 5.5M nodes and 1.7B directed edges, enabling flexible large-graph processing.
  • The three-stage NetGPT framework expressly integrates multimodal graph representation learning with instruction-based LLM reasoning.

Limitations:

  • Considerable computational expense due to large-scale RGCN pretraining and LLM fine-tuning (8 × A800 80GB GPUs required).
  • Some overfitting risk for smaller platforms (e.g., Bilibili) and potential label distribution mismatch between frequent and rare topics.
  • SPIR rating granularity is fixed at 10 levels; future work could investigate continuous-valued or finer-grained influence metrics.

A plausible implication is that broader deployment of SPIR-like frameworks could face practical resource barriers, and downstream tasks may benefit from adaptive label definitions or platform-specific calibration.

7. Practical Applications and Future Directions

SPIR furnishes an actionable foundation for several applications:

  • Commercial analytics: anticipatory identification of high-influence content to inform targeted advertising and partnership decisions.
  • Public-opinion monitoring: projecting potential societal impact of emergent or policy-relevant video content.
  • Recommendation systems: integrating long-term influence forecasts to improve engagement and ranking quality.
  • User-behavior modeling: capturing the transfer and amplification dynamics of user interactions and content virality in heterogeneous networks.

This suggests further exploration of continuous influence metrics, better transferability across platforms, and incorporation into production systems for content moderation, trending detection, or influence maximization.

In sum, the SPIR framework and XS-Video dataset provide a reproducible, large-scale benchmark for forecasting multidimensional, cross-platform short-video influence. The NetGPT pipeline demonstrates the feasibility and benefits of unifying heterogeneous graph-based pretraining with LLM reasoning for long-term propagation analysis (Xue et al., 31 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Short-video Propagation Influence Rating (SPIR).