Short-Video Propagation Influence Rating
- Short-video Propagation Influence Rating (SPIR) is a framework that predicts a ten-level influence score based on cross-platform metrics like views, likes, and shares.
- It leverages the XS-Video dataset, a large-scale benchmark capturing heterogeneous propagation data across major short-video platforms with detailed annotations.
- NetGPT, the integrated model, combines a heterogeneous graph convolutional network with LLM-based reasoning to achieve superior accuracy and lower predictive error.
Short-video Propagation Influence Rating (SPIR) addresses the problem of forecasting the long-term, multidimensional influence of short-video content across multiple online platforms. Formally, given a newly posted short-video within a large heterogeneous propagation network, SPIR seeks to predict its future influence as an integer rating from 0 to 9, integrating cross-platform views, likes, shares, collects, comments, and followership statistics over a multi-week time frame. This task transcends single-scalar popularity forecasting, instead requiring models to reason over rich, highly structured, and heterogeneous data. The SPIR benchmark is instantiated with the XS-Video dataset, which provides a large cross-platform propagation graph, and is addressed with NetGPT, a large graph model composed of a heterogeneous GNN backbone integrated with LLM-based reasoning.
1. SPIR: Formal Task Definition
Let denote the propagation graph, where is the set of nodes—including videos, platforms, topics, comments, interaction statistics, and other metadata— is the set of directed, heterogeneous edges encoding multimodal relations, and is the set of video nodes of interest. For each video node , the SPIR task requires a predictor , where is the true, multi-dimensional influence score of the video as aggregated across five platforms in the two weeks post-publication.
Unlike traditional popularity prediction tasks that forecast a scalar indicator (e.g., number of views or likes) within a short window, SPIR fuses the relative and absolute magnitudes of views, likes, shares, collects, comments, and follower count, across heterogeneous platforms with varying statistical profiles, to create a discrete, ten-level influence rating.
2. XS-Video Dataset: Construction and Annotation
XS-Video is the first large-scale, cross-platform short-video propagation dataset. Its composition is as follows:
| Statistic | Value |
|---|---|
| Platforms | Douyin, Kuaishou, Xigua, Toutiao, Bilibili |
| Distinct Short-videos | 117,720 |
| Samples | 381,926 |
| Topics | 535 |
| Features per Sample | Views, likes, shares, collects, comments, fans, comment text |
| Content Modalities | Video (MP4), title, description, topic, times, durations |
Data collection spanned from 2024-11-25 to 2025-01-02; newly posted videos were collected daily for each trending topic and subsequently re-sampled every two days up to two weeks. All user identifiers were anonymized.
Cross-platform alignment is necessary due to heterogeneity in audience size and engagement rates (e.g., Douyin’s DAU 0.6B vs. Xigua's MAU 0.2B). For each platform and set of multi-posted videos , alignment factors are computed by minimizing the mean squared percentage error (MSPE) between raw indicators from Kuaishou ("central platform") and those on platform . Specifically:
Subsequently, all indicators for each video are rescaled to the Kuaishou reference, and the final two-week aggregates determine ratings via stepped thresholds: if all five indicators are zero, ; otherwise, the level is the smallest index for which any aligned indicator surpasses the -th threshold (e.g., for require views , likes ). This process produces a label distribution naturally peaked at level 3.
3. Propagation Graph and Labeling Formalism
The propagation graph has:
- video, platform, topic, title, description, time, ctime, video_time, comment, likes, collects, views, shares, comments, fans,
- encoding relations: "is_platform_of", "is_topic_of", "has_same_author_as", "has_same_topic_as", "is_history_of", among others, totaling approximately directed edges.
Ratings are formally induced by the labeling function:
where encodes the stepped-threshold discretization defined above.
4. NetGPT: Model Architecture and Training Regime
NetGPT integrates a heterogeneous relational graph convolutional network (RGCN) with LLMs, proceeding via a three-stage training sequence.
Stage I: Heterogeneous Graph Pretraining
- Node representation extraction depends on modality:
- Video:
- Text (platform, topic, etc.):
- Time: sinusoidal positional encoding in
- Scalar statistics:
- Comments: concatenated ( for text, for time)
- Embeddings are updated via a two-layer RGCN:
- Pretraining objective:
Stage II: Supervised Language Fine-tuning
- The pretrained GNN and LLM (e.g., Qwen2-VL-7B) are frozen.
- Graph embedding tokens are projected:
- The instruction prompt (tokenized, with <|graph_pad|> embedding replaced by ) is given as input:
1
"You are a helpful assistant. Graph: (v,E). Please read the JSON-formatted video sample x_v, and predict its final propagation influence level (0–9)."
- Training objective:
Stage III: Task-oriented Predictor Fine-tuning
- Allow last four decoder layers of the LLM and , to update; new regression head is introduced.
- Final output:
where is the final hidden state of the assistant's end-of-sequence token.
- Fine-tune using:
5. Evaluation Protocol and Comparative Results
The evaluation treats SPIR both as 10-class classification and as regression. Given test set :
- Accuracy (ACC):
- Mean Squared Error (MSE):
- Mean Absolute Error (MAE):
Table: Main Results on XS-Video Test Split (117,720 videos; train/test = 4:1)
| Model | Input | ACC | MSE | MAE |
|---|---|---|---|---|
| GCN | Graph | 0.4474 | 1.0623 | 0.7599 |
| HAN | Graph | 0.2619 | 2.6666 | 1.2724 |
| HetSANN | Graph | 0.5078 | 0.8917 | 0.6803 |
| RGCN | Graph | 0.6313 | 0.7801 | 0.5844 |
| Mistral-7B | Text | 0.5387 | 2.1000 | 0.8123 |
| InternLM2.5-7B | Text | 0.5268 | 2.1110 | 0.8064 |
| Llama-3.1-8B | Text | 0.5290 | 2.1215 | 0.8081 |
| Qwen2.5-7B | Text | 0.5469 | 2.0820 | 0.7688 |
| Llava-Next-Video | Text+Video | 0.5694 | 1.8503 | 0.7315 |
| Qwen2-VL-7B | Text+Video | 0.5884 | 1.6820 | 0.6629 |
| NetGPT | Graph+Text | 0.6777 | 0.7169 | 0.5457 |
NetGPT produces the highest accuracy (+7.3% relative to RGCN), lowest MSE (–8.1%), and lowest MAE (–6.6%) among graph baselines, and outperforms the strongest multimodal LLM (Qwen2-VL) by +15.2% ACC, –57.4% MSE, and –17.7% MAE.
Ablation experiments show that removing video features (NetGPT-V), disabling video–video or interactive edges (NetGPT-VV/IV), omitting comment edges (NetGPT-CV), or skipping Stage II (NetGPT-SLF) each impair performance; likewise, reducing LLM capacity results in degradation.
With less than 3 days of observation, NetGPT retains a lead over RGCN, although predictive error increases for all models. Longer observation periods reduce error as expected.
6. Strengths, Limitations, and Interpretative Context
Strengths:
- Comprehensive, real-world propagation dataset spanning five major Chinese short-video platforms, with rich temporal, multimodal, and relational annotation.
- Heterogeneous propagation graph comprises 5.5M nodes and 1.7B directed edges, enabling flexible large-graph processing.
- The three-stage NetGPT framework expressly integrates multimodal graph representation learning with instruction-based LLM reasoning.
Limitations:
- Considerable computational expense due to large-scale RGCN pretraining and LLM fine-tuning (8 × A800 80GB GPUs required).
- Some overfitting risk for smaller platforms (e.g., Bilibili) and potential label distribution mismatch between frequent and rare topics.
- SPIR rating granularity is fixed at 10 levels; future work could investigate continuous-valued or finer-grained influence metrics.
A plausible implication is that broader deployment of SPIR-like frameworks could face practical resource barriers, and downstream tasks may benefit from adaptive label definitions or platform-specific calibration.
7. Practical Applications and Future Directions
SPIR furnishes an actionable foundation for several applications:
- Commercial analytics: anticipatory identification of high-influence content to inform targeted advertising and partnership decisions.
- Public-opinion monitoring: projecting potential societal impact of emergent or policy-relevant video content.
- Recommendation systems: integrating long-term influence forecasts to improve engagement and ranking quality.
- User-behavior modeling: capturing the transfer and amplification dynamics of user interactions and content virality in heterogeneous networks.
This suggests further exploration of continuous influence metrics, better transferability across platforms, and incorporation into production systems for content moderation, trending detection, or influence maximization.
In sum, the SPIR framework and XS-Video dataset provide a reproducible, large-scale benchmark for forecasting multidimensional, cross-platform short-video influence. The NetGPT pipeline demonstrates the feasibility and benefits of unifying heterogeneous graph-based pretraining with LLM reasoning for long-term propagation analysis (Xue et al., 31 Mar 2025).