VK RecSys Challenge 2025

Updated 5 February 2026

VK RecSys Challenge 2025 is an industrial-scale benchmark with 40B+ interactions, 20M videos, and rich multimodal data for recommender research.
It employs rigorous temporal splitting, multi-modal content embeddings, and evaluation metrics like ROC AUC and NDCG@20 to validate model performance.
The dataset supports diverse research tasks including sequential recommendation, cold-start user ranking, and fairness analysis across global demographics.

The VK Large Short-Video Dataset (VK-LSVD) is an open, large-scale industrial dataset designed for research and benchmarking in short-video recommendation. It contains over 40 billion user–video interaction events, representing 10 million users and approximately 20 million short-video items collected over a continuous 6-month period. VK-LSVD is structured to reflect platform-scale dynamics, offering a variety of implicit and explicit user feedback signals, multi-modal content embeddings, and rich contextual metadata aligned with real-world system constraints (Poslavsky et al., 4 Feb 2026).

1. Composition, Scale, and Temporal Organization

VK-LSVD comprises 40,774,024,903 interaction events, 19,627,601 unique video items, and 10,000,000 distinct anonymized users. The dataset was collected over 27 consecutive weeks, split into weekly Parquet files structured as follows:

Training set: Weeks 1–25
Validation set: Week 26
Test set: Week 27

Each weekly Parquet file contains chronologically ordered events, with the event timestamp retrievable via concatenation of file and row indices. The interaction rate averages ≈227 million events per day. Events encapsulate user_id, item_id, interaction context (place, platform, agent), watch time, and binary indicators for feedback types such as like, dislike, share, bookmark, click_on_author, and open_comments.

Feedback Signal Statistics

Feedback Type	Count	Description
timespent	capped at 255 s	Seconds spent watching a video
like	1,171,423,458	Binary, user liked the video
dislike	11,860,138	Binary, user disliked the video
share	262,734,328	Binary, user shared the video
bookmark	40,124,463	Binary, user bookmarked the video
click_on_author	84,632,666	Binary, profile clicks
open_comments	481,251,593	Binary, user opened comments

The density of the user–item matrix is approximately 0.0208%. Both user activity and item popularity exhibit heavy-tailed distributions typical for large recommendation platforms.

2. Data Features and Embeddings

VK-LSVD includes per-video 64-dimensional content embeddings ( $e_v \in \mathbb{R}^{64}$ ) derived by truncated SVD from the raw multimodal (visual + textual) feature space. These embeddings are stored as float16 arrays ordered by singular-value importance within an .npz file, and they support optional $\ell_2$ -normalization: $\hat{e}_v = e_v / \|e_v\|_2$ .

Externally supplied text embeddings ( $e_t = f_T(t)$ , with $t$ as textual metadata such as video title) can be fused or concatenated with the provided content embeddings for downstream use. Further feature dimensionality reduction (e.g., via PCA or UMAP) is supported for visualization or modeling. Video items are designated by item_id, and corresponding metadata includes author_id and video duration.

For contextual modeling, item interactions are labeled with:

place (uint8): feed vs. search
platform (uint8): device/platform class (e.g., Android, iOS, Web)
agent (uint8): client user-agent category

Users are described by user_id, age, gender, geo (1 of 80 regions), and a train_interactions_rank for split reproducibility.

3. Structure, Access, and Research Protocols

VK-LSVD is distributed as:

Interactions: week_01.parquet – week_27.parquet (user_id, item_id, place, platform, agent, timespent, binary feedbacks)
users_metadata.parquet: user demographic attributes
items_metadata.parquet: per-video metadata
item_embeddings.npz: content embedding arrays

For sequential recommendation protocols, user interaction sequences are chronologically sorted and split according to the Global Temporal Split (GTS). The canonical setup is: input sequence $x_u = [i_1, \ldots, i_{T-1}]$ to predict $i_T$ in validation/test, with leave-one-out evaluation per user. Session definitions typically use a $\Delta t \leq 30$ minutes threshold to group user actions.

4. Dataset Statistics and Diversity Analysis

VK-LSVD exhibits classic large-scale recommendation data properties:

Sparsity: Extremely low density (0.0208%) in the user–item matrix; both user and item interaction counts are heavy-tailed.
User Activity: Mean interactions per user ≈4,077 (with a lower median), indicating a substantial user cohort with high-frequency engagement but many with fewer actions.
Popularity Distribution: Power-law structure; top 1% of videos account for approximately 50% of all impressions.
Temporal Patterns: Rapid user interaction cycles, with mean inter-event time of a few minutes.
Diversity: 80 geographic regions, two primary platforms, three client agent types. Content cluster analysis (e.g., k-means, $k=100$ ) over embeddings demonstrates balanced content diversity.

5. Benchmarks, Evaluation, and Research Tasks

VK-LSVD supports a wide range of recommender system research contexts. Standard baseline performance results on the ur0.01_ir0.01 subset under GTS include:

Method	Coverage	ROC AUC	NDCG@20
Random	0.9645	0.5000	0.00006
Global Popularity	0.00010	0.5738	0.00244
iALS*	0.00501	0.5813	0.02623

*iALS: Implicit Alternating Least Squares, with positive label if watch_time > 10 s.

Evaluation metrics include Hit Rate@K and NDCG@K:

$HR@K = (1/|U|) \sum_{u \in U} \mathbb{I}[v_u^* \in \text{top}_K(u)]$
$NDCG@K = DCG@K / IDCG@K$ , with $DCG@K = \sum_{j=1}^K \frac{2^{rel_{u,j}-1}}{\log_2(j+1)}$

VK-LSVD enables:

Sequential recommendation (SASRec, GRU4Rec, BERT4Rec)
Cold-start item ranking (e.g., VK RecSys Challenge 2025: top-100 user ranking for new videos, evaluated by NDCG@100)
Interest drift modeling (time-evolving user factors)
Long-tail and fairness studies via demographic slices

6. Practical Guidance for Research and Best Practices

Recommended preprocessing and analysis strategies:

Filter out users with fewer than 5 interactions for model stability.
$\ell_2$ -normalize video embeddings pre-model input.
Construct negative samples using popularity-weighted sampling for robust pairwise learning objectives.
Group user interaction histories into sessions (e.g., $\Delta t$ threshold) for session-based approaches.

Key research directions facilitated by VK-LSVD include multi-modal fusion (visual and textual feature integration), context-aware modeling leveraging rich metadata (place, platform, agent), fairness and bias assessment across geo/demographic subgroups, zero-shot/cold-start solutions using only embeddings, and continual learning to address real-world temporal drift in content/user interest.

7. Impact and Legacy in Recommender Systems Research

VK-LSVD has been adopted as the benchmark dataset for VK RecSys Challenge 2025, engaging over 800 teams in production-scale user-ranking tasks under operational constraints (e.g., ≤100 recommendations/user). Its open availability under Apache 2.0 license has catalyzed new collaborations between academia and industry. VK-LSVD’s scale, heterogeneity, and temporal fidelity advance the state of public benchmarking for sequential, cold-start, and fairness-aware recommender system research (Poslavsky et al., 4 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

VK-LSVD: A Large-Scale Industrial Dataset for Short-Video Recommendation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VK RecSys Challenge 2025.