Papers
Topics
Authors
Recent
Search
2000 character limit reached

VK RecSys Challenge 2025

Updated 5 February 2026
  • VK RecSys Challenge 2025 is an industrial-scale benchmark with 40B+ interactions, 20M videos, and rich multimodal data for recommender research.
  • It employs rigorous temporal splitting, multi-modal content embeddings, and evaluation metrics like ROC AUC and NDCG@20 to validate model performance.
  • The dataset supports diverse research tasks including sequential recommendation, cold-start user ranking, and fairness analysis across global demographics.

The VK Large Short-Video Dataset (VK-LSVD) is an open, large-scale industrial dataset designed for research and benchmarking in short-video recommendation. It contains over 40 billion user–video interaction events, representing 10 million users and approximately 20 million short-video items collected over a continuous 6-month period. VK-LSVD is structured to reflect platform-scale dynamics, offering a variety of implicit and explicit user feedback signals, multi-modal content embeddings, and rich contextual metadata aligned with real-world system constraints (Poslavsky et al., 4 Feb 2026).

1. Composition, Scale, and Temporal Organization

VK-LSVD comprises 40,774,024,903 interaction events, 19,627,601 unique video items, and 10,000,000 distinct anonymized users. The dataset was collected over 27 consecutive weeks, split into weekly Parquet files structured as follows:

  • Training set: Weeks 1–25
  • Validation set: Week 26
  • Test set: Week 27

Each weekly Parquet file contains chronologically ordered events, with the event timestamp retrievable via concatenation of file and row indices. The interaction rate averages ≈227 million events per day. Events encapsulate user_id, item_id, interaction context (place, platform, agent), watch time, and binary indicators for feedback types such as like, dislike, share, bookmark, click_on_author, and open_comments.

Feedback Signal Statistics

Feedback Type Count Description
timespent capped at 255 s Seconds spent watching a video
like 1,171,423,458 Binary, user liked the video
dislike 11,860,138 Binary, user disliked the video
share 262,734,328 Binary, user shared the video
bookmark 40,124,463 Binary, user bookmarked the video
click_on_author 84,632,666 Binary, profile clicks
open_comments 481,251,593 Binary, user opened comments

The density of the user–item matrix is approximately 0.0208%. Both user activity and item popularity exhibit heavy-tailed distributions typical for large recommendation platforms.

2. Data Features and Embeddings

VK-LSVD includes per-video 64-dimensional content embeddings (evR64e_v \in \mathbb{R}^{64}) derived by truncated SVD from the raw multimodal (visual + textual) feature space. These embeddings are stored as float16 arrays ordered by singular-value importance within an .npz file, and they support optional 2\ell_2-normalization: e^v=ev/ev2\hat{e}_v = e_v / \|e_v\|_2.

Externally supplied text embeddings (et=fT(t)e_t = f_T(t), with tt as textual metadata such as video title) can be fused or concatenated with the provided content embeddings for downstream use. Further feature dimensionality reduction (e.g., via PCA or UMAP) is supported for visualization or modeling. Video items are designated by item_id, and corresponding metadata includes author_id and video duration.

For contextual modeling, item interactions are labeled with:

  • place (uint8): feed vs. search
  • platform (uint8): device/platform class (e.g., Android, iOS, Web)
  • agent (uint8): client user-agent category

Users are described by user_id, age, gender, geo (1 of 80 regions), and a train_interactions_rank for split reproducibility.

3. Structure, Access, and Research Protocols

VK-LSVD is distributed as:

  • Interactions: week_01.parquet – week_27.parquet (user_id, item_id, place, platform, agent, timespent, binary feedbacks)
  • users_metadata.parquet: user demographic attributes
  • items_metadata.parquet: per-video metadata
  • item_embeddings.npz: content embedding arrays

For sequential recommendation protocols, user interaction sequences are chronologically sorted and split according to the Global Temporal Split (GTS). The canonical setup is: input sequence xu=[i1,,iT1]x_u = [i_1, \ldots, i_{T-1}] to predict iTi_T in validation/test, with leave-one-out evaluation per user. Session definitions typically use a Δt30\Delta t \leq 30 minutes threshold to group user actions.

4. Dataset Statistics and Diversity Analysis

VK-LSVD exhibits classic large-scale recommendation data properties:

  • Sparsity: Extremely low density (0.0208%) in the user–item matrix; both user and item interaction counts are heavy-tailed.
  • User Activity: Mean interactions per user ≈4,077 (with a lower median), indicating a substantial user cohort with high-frequency engagement but many with fewer actions.
  • Popularity Distribution: Power-law structure; top 1% of videos account for approximately 50% of all impressions.
  • Temporal Patterns: Rapid user interaction cycles, with mean inter-event time of a few minutes.
  • Diversity: 80 geographic regions, two primary platforms, three client agent types. Content cluster analysis (e.g., k-means, k=100k=100) over embeddings demonstrates balanced content diversity.

5. Benchmarks, Evaluation, and Research Tasks

VK-LSVD supports a wide range of recommender system research contexts. Standard baseline performance results on the ur0.01_ir0.01 subset under GTS include:

Method Coverage ROC AUC NDCG@20
Random 0.9645 0.5000 0.00006
Global Popularity 0.00010 0.5738 0.00244
iALS* 0.00501 0.5813 0.02623

*iALS: Implicit Alternating Least Squares, with positive label if watch_time > 10 s.

Evaluation metrics include Hit Rate@K and NDCG@K:

  • HR@K=(1/U)uUI[vutopK(u)]HR@K = (1/|U|) \sum_{u \in U} \mathbb{I}[v_u^* \in \text{top}_K(u)]
  • NDCG@K=DCG@K/IDCG@KNDCG@K = DCG@K / IDCG@K, with DCG@K=j=1K2relu,j1log2(j+1)DCG@K = \sum_{j=1}^K \frac{2^{rel_{u,j}-1}}{\log_2(j+1)}

VK-LSVD enables:

  • Sequential recommendation (SASRec, GRU4Rec, BERT4Rec)
  • Cold-start item ranking (e.g., VK RecSys Challenge 2025: top-100 user ranking for new videos, evaluated by NDCG@100)
  • Interest drift modeling (time-evolving user factors)
  • Long-tail and fairness studies via demographic slices

6. Practical Guidance for Research and Best Practices

Recommended preprocessing and analysis strategies:

  • Filter out users with fewer than 5 interactions for model stability.
  • 2\ell_2-normalize video embeddings pre-model input.
  • Construct negative samples using popularity-weighted sampling for robust pairwise learning objectives.
  • Group user interaction histories into sessions (e.g., Δt\Delta t threshold) for session-based approaches.

Key research directions facilitated by VK-LSVD include multi-modal fusion (visual and textual feature integration), context-aware modeling leveraging rich metadata (place, platform, agent), fairness and bias assessment across geo/demographic subgroups, zero-shot/cold-start solutions using only embeddings, and continual learning to address real-world temporal drift in content/user interest.

7. Impact and Legacy in Recommender Systems Research

VK-LSVD has been adopted as the benchmark dataset for VK RecSys Challenge 2025, engaging over 800 teams in production-scale user-ranking tasks under operational constraints (e.g., ≤100 recommendations/user). Its open availability under Apache 2.0 license has catalyzed new collaborations between academia and industry. VK-LSVD’s scale, heterogeneity, and temporal fidelity advance the state of public benchmarking for sequential, cold-start, and fairness-aware recommender system research (Poslavsky et al., 4 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VK RecSys Challenge 2025.