Micro-Video Recommendation Systems

Updated 31 March 2026

Micro-video recommendation systems are specialized platforms that deliver personalized, dynamic video streams by integrating multimodal cues and sequential models.
They address challenges like duration bias and rapid content turnover using techniques such as adversarial debiasing, contrastive self-supervision, and temporal trend routing.
Cutting-edge approaches, including graph-based reasoning, multi-interest user modeling, and end-to-end multimodal fusion, drive notable gains in accuracy, fairness, and engagement.

Micro-video recommendation systems enable platforms such as TikTok, Kuaishou, and WeChat Channels to deliver highly personalized, dynamic video streams to users. These systems confront unique technical challenges and opportunities—ranging from severe interaction sparsity and extremely short content lifespans to duration bias in implicit feedback and the necessity to model multimodal semantics at scale. Research in this area has produced diverse architectures that integrate sequential modeling, graph-based reasoning, contrastive and adversarial learning, and multimodal fusion, with rigorous protocol and metric design to address production requirements and real-world biases.

1. Duration Bias and Debiased Evaluation in Micro-Video Recommendation

A foundational challenge in micro-video recommendation is duration bias: longer videos naturally accumulate greater watch time, making naïve ranking by watch time—still prevalent in deployed systems—systematically favor longer videos regardless of true user engagement. Empirical studies on platforms such as TikTok and Kuaishou show that both total watch time and watch-time-per-impression (WTPI) metrics are fundamentally biased: the former favors long videos, while the latter over-compensates toward ultra-short clips, neither of which yields fair or accurate recommendations (Zheng et al., 2022).

To address this, the Watch-Time-Gain (WTG) metric standardizes observed watch time for each video within fixed-duration bins, rendering the engagement signal uncorrelated with absolute duration. Given a video $v$ of duration $d_v$ , and observed watch time $\operatorname{WT}_{uv}$ for user $u$ , compute:

Bin $B(v)$ —the duration bucket for $v$ (e.g., 1-second width),
$\mu_{B(v)}$ , $\sigma_{B(v)}$ : mean and standard deviation of watch time for all videos in $B(v)$ ,
$\operatorname{WTG}_{uv} = \frac{\operatorname{WT}_{uv} - \mu_{B(v)}}{\sigma_{B(v)}}$ .

Aggregated recommendation quality is reported as $WTG@K$ (mean WTG over top-K) and $DCWTG@K$ (discounted cumulative WTG) over ranked lists.

The Debiased Video Recommendation (DVR) architecture operationalizes this metric via adversarial learning: a backbone regression model predicts $\hat{Y}_{WTG}$ , trained to match WTG while an adversarial head (via gradient reversal) minimizes the ability to recover video duration from the learned representation, enforcing duration invariance. Experiments across two platforms and multiple backbones show that replacing the duration-based target with WTG and introducing adversarial debiasing yields $\geq$ 200–500% gains in $WTG@10$ , drastically reducing bad-case recommendations (WT $<$ 2 s) and eliminating duration bias (Zheng et al., 2022).

2. Multi-Interest User Modeling and Self-Supervised Learning

User preferences in micro-video platforms span multiple, often orthogonal interest facets. Relying on a single user embedding is suboptimal given the diversity of content that attracts each user. The Contrastive Multiple Interests (CMI) framework assigns each user $m$ disentangled interest embeddings by soft-assigning historical interactions to implicit "category" vectors under orthogonality constraints (Li et al., 2022). Contrastive self-supervision is used to augment robustness: for each user and each of $m$ interests, augmented subsequences produce positive and negative pairs, and losses align matching interests while repulsing intra- and inter-user negatives, further denoising the effects of noisy click data.

CMI’s architecture combines a multi-interest encoder (responsibility-weighted sums over category-video affinities), a general-sequence encoder (global sequential GRU), and a hybrid prediction mechanism. Empirically, CMI achieves state-of-the-art improvements on Recall@K metrics (e.g., +29.7% Recall@10 on WeChat Channels), with ablation confirming that contrastive self-supervision and multi-faceted user representations yield substantial gains over both single-interest and classical multi-interest baselines (Li et al., 2022).

3. Temporal Dynamics, Trend Routing, and Sequential Modeling

Micro-video interaction patterns display rapid temporal evolution, both across and within sessions. Systems must learn both historical trend persistence and anticipate future trends, particularly for newly emergent content and shifting user interests. The Multi-trends Enhanced Dynamic Micro-video Recommendation (DMR) framework explicitly models historical preference fragments and "future" trend fragments by extracting sequence fragments from similar users whose behaviors progress beyond the current user’s history (Lu et al., 2021). All such fragments are routed into a fixed set of trend slots via capsule-style dynamic routing, allowing the model to distill both history and emergent preferences into compact representations. A time-aware attention mechanism then fuses the historical and future-trend vectors to produce the final user embedding used for ranking.

On large public datasets, DMR outperforms state-of-the-art sequential, attention-based, and diversification-focused models across Recall, Precision, F1, and Diversity metrics, particularly benefiting from explicitly balancing the user’s own past with analogous peer-evolved behaviors (Lu et al., 2021).

4. Multimodal and Content-Driven Architectures

Modern micro-video recommenders increasingly leverage multimodal information: video frames, audio, cover images, and textual metadata. Content-driven benchmarks such as MicroLens provide large-scale datasets with raw access to these modalities. Architectures are typically categorized into:

IDRec: User/item ID-based collaborative filtering,
VIDRec: CF fused with frozen video features,
VideoRec: End-to-end video encoder-based models, optionally fusing multimodal cues with learned or frozen representations.

On MicroLens, end-to-end video encoder architectures (e.g., SASRec with a trainable SlowFast or VideoMAE backbone) deliver $\approx$ 10–20% relative improvements in HR@10/NDCG@10 over pure ID-based or shallow fusion models, both in warm and cold-start settings (Ni et al., 2023). However, naive frozen feature fusion often provides negligible gain, underscoring the necessity for end-to-end adaptation. Efficient multimodal processing—including robust text/image pre-processing, layer-wise fine-tuning, and hybrid loss functions (sampled softmax, pairwise ranking)—is crucial for scalability and timeliness.

Meta-learning–based fusion frameworks (e.g., MetaMMF) go further by dynamically parameterizing item-specific fusion networks ("fusion as a task") using meta features extracted from each video’s V/A/T encodings, yielding state-of-the-art improvements and efficient convergence through CP decomposition (Liu et al., 13 Jan 2025).

5. Graph-Based and Higher-Order Structural Techniques

Graph neural network-based architectures address the sparse, noisy, and deeply structured nature of micro-video recommendation. Approaches such as Concept-Aware Denoising GNN (CONDE) leverage heterogeneous tripartite graphs encompassing users, videos, and text-derived concept entities. CONDE applies a personalized two-hop GRU-based denoising pipeline to extract a customized subgraph for each user, filtering away both noisy edges and irrelevant concepts before the final preference refinement stage. This yields substantial uplifts (+6–9% AUC, 0.3–0.5 NDCG@5) over standard GNNs (Liu et al., 2021).

More recent models such as MTHGNN introduce time-warped, multi-aggregator GNNs on multimodal, sequentially sliced session graphs, explicitly encoding temporal order, multi-type interactions (like/finish/comment), cross-modal attributes, and session-level transitions (Han et al., 5 Jan 2025). Lightweight graph-free sampling is performed for cold-start candidate scoring. Similarly, layer-wise attention on GNN aggregators, modality-level attention, and subgraph routing are recurrent themes for balancing expressiveness, computational cost, and timely adaptation.

Hypergraph-based models with multi-view self-supervision, such as MHCR, build interaction, modality, and cross-modal hyperedges, employing contrastive objectives across both modality views and graph/hypergraph layers. Such methods are highly effective in cold-start regimes, with up to 11% recall and 13% NDCG gains over multimodal GCN baselines (Lyu et al., 2024).

6. Self-Attention, Position Bias, and Advanced Sequential Encoding

Self-attention architectures adapted from Transformer literature (e.g., SASRec, BERT4Rec) are increasingly prominent, but require careful adaptation to micro-video streams. The PDMRec model disentangles position and semantic encoding by running separate self-attention blocks for item and positional embeddings, fusing only post-attention to avoid spurious coupling between micro-video semantics and system-imposed position order (Yu et al., 2022). Position-decoupled modeling, combined with position-invariance enforced by contrastive learning via random subsequence reordering, produces $2$– $6\%$ relative gains in Recall@50 over strong sequential baselines.

Contrastive and curriculum-based self-supervision, as in CCL4Rec, further shields the learned user embeddings from ubiquitous behavioral noise by harnessing hardness-aware augmentations (importance-scored history replacements, adaptive margins across positives/negatives, and dynamic curriculum sampling), achieving comparable accuracy with $\sim$ 24 $\times$ lower training time and $\sim$ 460 $\times$ faster inference over attention-based models (Zhang et al., 2022).

Advanced attention modules, such as the Self-over-Co Attention paradigm, explicitly model high-order dependencies across both interaction-levels (co-attention between e.g., "likes" and "follows") and within-level sequential self-attention before fusing user representations (Yao et al., 2021). These high-order multi-interest models demonstrate 2–3% AUC gains over state-of-the-art hierarchical attention networks on filtered micro-video datasets.

7. Debiasing, Real-World Deployment, and Fairness Considerations

Operational systems increasingly recognize the necessity of explicit debiasing. The VLDRec approach attacks the view-time (video-length) bias by relabeling training data to depend on within-group play progress, and sampling negatives from both general and length-matched groups under a multi-task loss (Quan et al., 2023). This multi-task architecture, combined with the View_Time@T evaluation metric (computed under a fixed total recommended video length constraint), aligns offline metrics with business/product goals and yields improvements both in normalized engagement and content alignment with user interests.

Real-world deployments, such as DreamUMM (based on the Platonic Representation Hypothesis), validate the practical production relevance of unified multimodal spaces: user and video representations are mapped into a shared space using MLLM-derived embeddings, enabling inference-time user vectors via closed-form preference-weighted sums over recent histories, with an explicit cold-start variant (Candidate-DreamUMM) for candidate-only user inference (Lin et al., 2024). Infrastructure at deployable scale includes vector retrieval, latency-constrained ANN for candidate selection, feature store integration for up-to-date histories, and A/B evaluation at $O(10^8)$ DAU scale, with statistically significant lifts (e.g., $+0.87\%$ play count, $+1.8\%$ exposed clusters).

Continuous advances in graph sparsity handling, cold-start representation, session slicing, and adversarial/contrastive self-supervision are driving micro-video recommender systems toward higher accuracy, fairness, and diversity, with architectures regularly evaluated on massive multimodal datasets and aligning ever more closely with challenging production requirements (Zheng et al., 2022, Li et al., 2022, Quan et al., 2023, Ni et al., 2023, Liu et al., 13 Jan 2025, Lin et al., 2024).