MicroLens: Multimodal Micro-Video Dataset

Updated 7 January 2026

MicroLens is a multimodal micro-video dataset containing extensive user interactions and rich metadata across text, image, audio, and video modalities.
It employs a leave-one-out, timestamp-based split per user with a maximum history of 13 interactions to benchmark sequential recommendation models.
Baseline experiments show that end-to-end video representation learning can significantly enhance recommendation accuracy over traditional ID-based methods.

MicroLens is a large-scale, multimodal micro-video recommendation dataset specifically constructed to catalyze research in content-driven micro-video recommendation systems. It provides extensive user–item interaction logs, rich video metadata, and raw media modalities for each item, facilitating benchmarking and exploration of advanced recommendation models beyond traditional ID-based collaborative filtering. MicroLens is publicly available along with code for baseline experiments and access instructions (Ni et al., 2023).

1. Dataset Composition and Scope

MicroLens comprises interaction events between a vast population of users and a corpus of micro-videos. The full dataset contains 34,492,051 unique users, 1,142,528 micro-videos, and 1,006,528,709 user–video interactions, yielding a sparsity of approximately 99.997%. Two research-friendly subsets are provided:

Subset	Users	Videos	Interactions	Sparsity
MicroLens-100K	100,000	19,738	719,405	99.96%
MicroLens-1M	1,000,000	91,402	9,095,620	99.99%

All interactions are derived from publicly visible user comments. For recommendation experiments, leave-one-out timestamp-based splits are applied per user: the latest comment forms the test set, the penultimate comment forms validation, and the remainder constitutes training data. Maximum user sequence length is capped at 13 (≥ 95% coverage).

2. Modalities and Metadata

Each micro-video in MicroLens includes the following modalities and metadata:

Text: Human-written titles, at least 3 characters post-cleaning.
Image: Raw cover image (JPEG/PNG), post-filtering for uniform color regions (< 75%).
Audio: Native audio track (AAC/MP3, 44.1 kHz).
Video: Full-length micro-video (MP4/H.264), files ≥ 100 KB.
Metadata: View count, like count, video duration (∼100–400 seconds), tag IDs (fine-grained categories), and user gender.
User Comments: Up to 5,000 per video, storing anonymized user_id, timestamp, and comment_text.

All IDs are anonymized by hashing, respecting user privacy and platform licensing by referencing media through URLs instead of direct distribution.

3. Data Acquisition and Preprocessing

The dataset was assembled via a two-stage crawl. Seed videos were identified by home-page crawl (June 2022–June 2023), filtered for high engagement (≥ 10,000 likes, typically ≥ 100 comments). For each seed, ten randomly chosen related videos were included, forming a candidate pool of ∼5 million items. Public comment pages were scraped for up to 5,000 comments per video with duplicate removal for repeated user entries.

Modalities underwent several filters:

Titles with < 3 characters were excluded.
Cover images dominated by uniform color (≥ 75%) were excluded.
Video files < 100 KB were excluded.
URLs and metadata were de-duplicated.
All user and item identifiers were hashed.

Splits are provided for the 100K and 1M subsets, as outlined above. Licensing compliance is maintained by offering media via a Python URL resolver analogous to YouTube-8M procedures, avoiding copyright infringement.

4. Data Structure, Formatting, and Access

Interaction records are provided as CSV/TSV files with the following schema:

Field	Type	Description
user_id	int	Hash-anonymized unique user identifier
item_id	int	Hash-anonymized unique video identifier
timestamp	int	UNIX epoch (seconds)
comment_text	string	UTF-8 encoded user comment

Additional fields (user_gender, session_id) are provided as optional columns.

“Item metadata” is stored as JSON or CSV with attributes:

Attribute	Format	Description
item_id	int	Hash-anonymized video identifier
title	string	Video title
cover_image_url	string	JPEG/PNG image URL
video_url	string	MP4/H.264 video URL
audio_url	string	AAC/MP3 audio URL
view_count,like_count	int	Engagement metrics
duration	int	Duration in seconds
tags	list	Fine-grained category IDs

Multimedia files retain native platform resolutions: images (typically 720×1280 or 480×270 JPEG/PNG), audio tracks (AAC/MP3, mono/stereo), and full videos (240p–720p MP4).

The dataset and supporting scripts are hosted on GitHub: https://github.com/westlake-repl/MicroLens. Download tools include a Python script for resolving and fetching media.

5. Baseline Models and Benchmarking Protocol

MicroLens enables the evaluation of both collaborative filtering and content-driven recommendation approaches with the following baseline configurations:

IDRec (ID-only):
- Collaborative filtering: DSSM, LightGCN, NeuralFM, DeepFM.
- Sequential modeling: GRU4Rec, NextItNet, SASRec.
VIDRec (ID plus frozen video features):
- Baselines: YouTube (ID), YouTube (ID+V), MMGCN, GRCN, DSSM+V, SASRec+V.
- Video features extracted from pre-trained VideoMAE.
VideoRec (end-to-end video representation learning):
- Models: NextItNet₍V₎, GRU4Rec₍V₎, SASRec₍V₎, wherein the video encoder replaces the item embedding and is trained jointly.

Video encoders evaluated:

CNN-based: R3D-r18, R3D-r50, C2D-r50, I3D-r50, CSN-r101, Slow-r50, SlowFast (r50, r101).
Efficient nets: X3D (XS/S/M/L).
Transformer-based: MViT-B-16×4, MViT-B-32×3, VideoMAE (ViT-B), all pre-trained on Kinetics.

Losses employ in-batch (sampled) softmax; learning rates for IDRec ∈ {1e-5, …, 1e-3}, VideoRec 1e-4 (top-block fine-tune at 1e-4); batch sizes: IDRec (64–512), VideoRec (120). AdamW optimizer with weight_decay ∈ {0, …, 0.1}, dropout=0.1.

Evaluation is conducted per test user, ranking 1 positive target among 100 negatives, with metrics at K=10 and K=20:

Hit Ratio@K: $\mathrm{HR@K} = \mathbb{I}\bigl(\mathrm{rank}(v^*) \le K\bigr)$
Recall@K: $\mathrm{Recall@K} = \frac{1}{|G|} \sum_{i=1}^K \mathbb{I}(r_i \in G)$
Precision@K: $\mathrm{P@K} = \frac{1}{K}\sum_{i=1}^K \mathbb{I}(r_i\in G)$
NDCG@K: $\mathrm{NDCG@K} = \frac{1}{Z}\sum_{i=1}^K \frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}$ , with $\mathrm{rel}_i \in \{0,1\}$ and $Z$ as ideal DCG.

6. Principal Findings and Recommendations

Benchmarks reveal several modality and algorithmic trends:

Sequential IDRec (SASRec) achieves >10% improvement over collaborative filtering IDRec.
Adding frozen video features (VIDRec) provides negligible or negative impact in warm-item settings.
VideoRec (end-to-end) outperforms both IDRec and VIDRec, achieving up to 2× NDCG improvement compared to frozen features.
Pre-training on Kinetics enhances fine-tuned video encoder performance in HR@10 by ∼1.5% over random initialization; full fine-tuning yields minor additional gains.
Top classification accuracy of video encoders (e.g., VideoMAE) does not guarantee leading recommendation accuracy.
Optimal practice is SASRec with VideoRec: transformer backbone and joint training of top layers in the video encoder.
User history should be limited to the most recent 13 comment interactions.
Efficient encoding: sample 5 contiguous middle-section frames per video.
Leave-one-out split per user best reflects sequential recommendation settings.
Fine-tuning only the top encoder blocks suffices, mitigating catastrophic forgetting.

7. Prospective Extensions and Applications

MicroLens’s structure facilitates several research avenues:

Direct modeling of audio via spectrogram encoding for improved recommendation.
Cross-attention or advanced fusion across text, image, audio, and video for multimodal recommendation.
Self-supervised pre-training on user–video interactions to initialize video encoders.
Development of foundation models for video-user interaction (e.g., multimodal BERT/GPT).
Investigation of cold-start and transfer learning for novel domains or platforms.
Contrastive learning for alignment of user–video or video–video embeddings.

A plausible implication is that MicroLens will underpin methodological advances in scalable, multimodal, content-driven micro-video recommendation, serving as a benchmark for both current and future recommendation system architectures (Ni et al., 2023).

PDF Markdown Chat (Pro)

References (1)

A Content-Driven Micro-Video Recommendation Dataset at Scale (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MicroLens Dataset.