VK-LSVD: Large-Scale Short-Video Dataset

Updated 5 February 2026

The dataset captures over 40 billion user-video interactions across 27 weeks, enabling robust analysis of sequential recommendation and dynamic user behavior.
It provides rich feedback signals and metadata such as timespent, likes, and contextual fields like platform and geo, which facilitate research in fairness and cold-start problems.
The Global Temporal Split protocol and SVD-based content embeddings standardize evaluation, making VK-LSVD a key benchmark for the VK RecSys Challenge 2025.

The VK Large Short-Video Dataset (VK-LSVD) is a large-scale, public dataset designed to support research in short-video recommendation modeling under industrial conditions. Comprising over 40 billion interactions from 10 million users across nearly 20 million videos over a six-month period, VK-LSVD provides a unique resource for the recommendation systems community, enabling research in sequential modeling, cold-start problems, fairness, and dynamic user interest analysis. Its construction, content, and role as the official dataset for VK RecSys Challenge 2025 establish it as a principal benchmark for industrial recommender system research (Poslavsky et al., 4 Feb 2026).

1. Composition, Scale, and Temporal Layout

VK-LSVD captures explicit user–video interaction data at an unprecedented scale. The dataset includes:

Users: 10,000,000 distinct anonymous user_id's.
Videos: 19,627,601 unique item_id's.
User–Video Events: 40,774,024,903 exposure interactions recorded sequentially over 27 weeks, with a global average of approximately 227 million events per day.
Total Watch Time: 858,160,100,084 seconds aggregated over all impressions.

The temporal structure is implemented through a Global Temporal Split (GTS), where each week of data is stored as a separate Parquet file (week_01.parquet, ..., week_27.parquet). The dataset partitions for evaluation are defined as:

Train: First 25 weeks
Validation: Week 26
Test: Week 27

All events within weekly files are chronologically ordered, preserving temporal integrity essential for sequential modeling and temporal generalization research.

2. Feedback Signals and Metadata

VK-LSVD records a diverse set of feedback signals and contextual metadata reflecting real-world platform dynamics:

Major Feedback Signals (with corresponding event counts)

timespent (watch time, capped at 255 s)
like: 1,171,423,458
dislike: 11,860,138
share: 262,734,328
bookmark: 40,124,463
click_on_author: 84,632,666
open_comments: 481,251,593

Contextual Metadata Fields

place (uint8): feed vs. search context
platform (uint8): e.g., Android, Web, iOS
agent (uint8): client user-agent category
geo (user metadata): one of 80 geographic regions
timestamps: implicit—epoch time can be reconstructed via file name and row order

This rich set of signals supports nuanced research into user preferences, session dynamics, cross-platform behavior, and geolocalized phenomena.

3. File Formats, Schema, and Content Embeddings

The dataset is organized for efficient access and scalability in research workflows. Core elements include:

Subset/Component	File Format	Columns / Content
Interactions	27 weekly Parquet files	user_id, item_id, place, platform, agent, timespent, like, dislike, share, bookmark, click_on_author, open_comments
User Metadata	users_metadata.parquet	user_id, age, gender, geo, train_interactions_rank
Item Metadata	items_metadata.parquet	item_id, author_id, duration, train_interactions_rank
Item Embeddings	item_embeddings.npz	item_id, embedding (float16[64])

Content Embeddings: Each video is assigned a 64-dimensional compressed embedding obtained via truncated SVD on original multi-modal (visual and textual) features. The item embedding matrix $E \in \mathbb{R}^{N_{\text{items}}\times64}$ is computed using:

$[U, \Sigma, V] = \mathrm{SVD}(F)$
$E = U[:,1:64]\, \Sigma[1:64,1:64]$

Embeddings are stored as float16 and ordered by singular-value importance, supporting $\ell_2$ -normalization for modeling, as well as further projection (e.g., PCA, UMAP) for visualization or dimensionality reduction.

Text embeddings ( $e_t = f_T(t)$ ) may be used for fusion with video embeddings, if provided externally.

4. Data Statistics, Diversity, and Quality Assessment

The dataset exhibits industrial-scale sparsity and long-tail effects:

Density: Approximately 0.0208% (total interactions / user–item pairs), indicating high sparsity.
User Activity: Mean ≈ 4,077 events/user with pronounced heavy tail; many users have low activity, while a small fraction are highly engaged.
Item Popularity: Distribution follows a power law; top 1% of videos receive roughly 50% of all impressions.
Session Dynamics: The average inter-event time ( $\Delta t$ ) is on the order of minutes. Session segmentation via $\Delta t \leq 30$ minutes is common.
Content and Interaction Diversity: 80 geographic regions, 2 platforms, 3 agent categories. K-means clustering of embeddings (e.g., $k=100$ ) yields balanced content clusters, indicating substantive diversity.

5. Benchmark Protocols, Evaluation, and Industrial Impact

VK-LSVD is integral to reproducible benchmarking for recommender systems. The recommended evaluation protocols leverage the GTS for rigorous, temporally-ordered training and testing. For sequential recommendation:

Chronologically order each user's interactions.
Split per GTS: train/val/test.
Predict each user's next item given prior history, applying leave-one-out evaluation.

Baseline Methods on ur0.01_ir0.01 Subset (GTS split):

Method	Coverage	ROC AUC	NDCG@20
Random	0.9645	0.5000	0.00006
Global Popularity	0.00010	0.5738	0.00244
iALS (threshold >10 s)	0.00501	0.5813	0.02623

Evaluation metrics include Hit Rate@K (HR@K) and NDCG@K as:

$HR@K = \frac{1}{|U|} \sum_{u \in U} \mathbb{1}[v^*_u \in top_K(u)]$
$NDCG@K = \frac{DCG@K}{IDCG@K}$ , with $DCG@K = \sum_{j=1}^{K} \frac{2^{rel_{u,j}} - 1}{\log_2(j+1)}$

VK-LSVD serves as the official dataset for the VK RecSys Challenge 2025, which attracted approximately 800 teams, and is openly available under an Apache 2.0 license.

6. Practical Considerations and Research Applications

The scale and design of VK-LSVD support a range of research strategies:

Preprocessing: Remove users with less than five interactions for statistical stability; $\ell_2$ -normalize item embeddings; negative sampling using popularity weights; segment user sessions using $\Delta t$ heuristics.
Modeling Directions:
- Multi-modal fusion (visual and textual embeddings)
- Context-aware recommendation (incorporating place/platform/agent)
- Fairness and demographic bias analysis (via geo, gender, age)
- Cold-start and zero-shot ranking (embedding-driven, metadata-only situations)
- Continual learning (dynamic adaptation to shifting popularity/interests)

Significant research tasks fostered by VK-LSVD include sequential recommendation (e.g., SASRec, GRU4Rec, BERT4Rec), cold-start ranking (top-100 users for new videos), modeling of user interest drift, and long-tail recommendation with fairness constraints.

7. Significance in Recommender Systems Research

VK-LSVD’s integration of vast user interaction logs, multi-modal embeddings, real-world temporal structure, and rich contextual signals facilitates benchmarking models under conditions reflective of contemporary industrial platforms. Its open release and utility for high-profile challenges have accelerated collaboration between academic and industrial communities. The dataset supports evaluation of scalable, time-aware, robust sequential and context-rich recommender algorithms and is positioned as a standard resource for research on dynamic, large-scale, multimedia-driven recommendation scenarios (Poslavsky et al., 4 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

VK-LSVD: A Large-Scale Industrial Dataset for Short-Video Recommendation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VK Large Short-Video Dataset (VK-LSVD).