Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 22 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 91 tok/s
GPT OSS 120B 463 tok/s Pro
Kimi K2 213 tok/s Pro
2000 character limit reached

MicroLens-100K: Micro-Video Recommendation Dataset

Updated 14 August 2025
  • MicroLens-100K is a public micro-video recommendation dataset defined by 719,405 user-item interactions among 100,000 users and 19,738 micro-videos with raw content.
  • It offers comprehensive raw modalities—including titles, cover images, audio, and full-length videos—that enable multimodal modeling and end-to-end learning approaches.
  • Benchmark experiments show that end-to-end video encoder models significantly outperform frozen-feature methods, highlighting the importance of fine-tuning in recommendation tasks.

MicroLens-100K is a large-scale, public micro-video recommendation dataset that serves as a benchmark for content-driven micro-video recommendation and video understanding research. It provides approximately 719,405 user–item interactions among 100,000 users and 19,738 micro-videos, each enriched with raw modality data—titles, cover images, audio tracks, and full-length video files. With high sparsity (99.96%) and extensive modality coverage, MicroLens-100K facilitates multimodal modeling and empirically robust benchmarking in the recommender systems domain, particularly for content-aware and end-to-end video recommendation paradigms (Ni et al., 2023).

1. Dataset Composition and Statistical Properties

MicroLens-100K is a subset of the larger MicroLens dataset, which captures one billion user-item interactions from 34 million users and over 1.1 million micro-videos. In creating public research subsets suitable for academic experimentation, MicroLens-100K consists of the following elements:

  • Users: 100,000 unique user identifiers.
  • Items: 19,738 unique micro-videos.
  • Interactions: 719,405 observed user–item behaviors.
  • Sparsity: 99.96%.
  • Video duration: Average of ~161 seconds per video.
  • Category coverage: 15,580 fine-grained categorical tags, indicating substantial topical and genre diversity.

This high sparsity, combined with broad tag coverage, provides a realistic approximation of the challenges in micro-video recommendation for long-tail and cold-start items, as well as for users with sparse interaction histories.

2. Raw Content Modalities

Each item in MicroLens-100K is accompanied by a suite of primary content modalities:

  • Text modality: The original video title.
  • Visual modality: The cover image, provided as a raw visual artifact.
  • Audio modality: The full audio waveform corresponding to the video.
  • Video modality: The complete video file.

This level of content accessibility differs from prior public datasets, which often provide only anonymized IDs or pre-extracted features such as thumbnails. The provision of raw modalities allows researchers to pursue multimodal modeling strategies, including joint or sequential feature extraction, modality fusion, and direct end-to-end optimization over video, image, audio, and text signals.

The dataset thus supports a broad array of learning paradigms, ranging from classical collaborative filtering (CF) to content-based retrieval, and notably enables video encoders to operate directly from pixel and waveform inputs, integrating collaborative and content semantics in a unified embedding space.

3. Modeling Benchmarks and Empirical Findings

Comprehensive benchmarking on MicroLens-100K is established for three major model families:

Category Description Example Methods
IDRec Uses only item IDs (classical and sequential CF) DSSM, LightGCN, DeepFM, GRU4Rec, SASRec
VIDRec Fuses pre-extracted (frozen) video features with ID embeddings ID embedding + frozen FE, NextItNet
VideoRec Replaces item ID with a learnable, end-to-end video encoder SlowFast, VideoMAE, MVIT (TopT fine-tune)

Key empirical results include:

  • Within IDRec models, sequential neural models (e.g., SASRec employing Transformer architectures) surpass classical methods (LightGCN, DeepFM) by >10% in HR@10 and NDCG@10.
  • VIDRec does not consistently outperform IDRec models when user–item interaction data is abundant, implying that ID embeddings can capture collaborative and some content signals without explicit video features.
  • VideoRec models, where the item embedding is wholly replaced by a trainable video encoder, yield substantial performance improvements: e.g., retraining SlowFast or NextItNet encoders in end-to-end setups nearly doubles HR and NDCG metrics relative to their frozen-feature counterparts.
  • Experiments reveal that state-of-the-art video encoders (VideoMAE, SlowFast, MVIT) pre-trained on classification benchmarks do not yield representations universally optimal for recommendation tasks; fine-tuning, particularly the "TopT" strategy (training only the top layers), is critical in aligning video semantics with user preference distributions.

4. Use Cases and Research Applications

The structure and richness of MicroLens-100K enable diverse research and industry applications:

  • Entertainment: The dataset is optimal for recommending short-form content in entertainment platforms, where user engagement relies on both collaborative filtering and nuanced content understanding.
  • Advertising: With fine-grained, multi-signal inputs (visual, audio, textual), the dataset supports research into personalized advertisement recommendation based on the intrinsic content and style of micro-videos.
  • E-Commerce: The multimodal nature enables robust recommendation for product discovery scenarios leveraging video demonstrations, permitting the integration of behavioral and visual merchandising cues.

These use cases also foreground the importance of cross-modal fusion techniques, as the dataset’s breadth of tags and content modalities supports methods that generalize across varying recommendation contexts and domain shifts. The data’s raw modality provision further positions it as a foundation for pre-training domain-adaptive video encoders for recommender scenarios.

5. Access, Usage, and Technical Framework

MicroLens-100K, alongside accompanying training code, is publicly available at [https://github.com/westlake-repl/MicroLens]. Technical appendices of the paper provide additional implementation details, including data preprocessing, in-batch softmax loss formulation, and the leave-one-out evaluation protocol.

Key technical parameters for baseline experiments include:

  • Hyper-parameter grids for IDRec: learning rates in {1×105,5×105,1×104,5×104,1×103}\{1\times10^{-5}, 5\times10^{-5}, 1\times10^{-4}, 5\times10^{-4}, 1\times10^{-3}\}; embedding sizes: {64,128,256,512,1024,2048,4096}\{64, 128, 256, 512, 1024, 2048, 4096\}.
  • For VideoRec (video encoder-based methods), training is restricted to the top TT layers (the "TopT" approach) due to computational constraints.
  • Benchmarked losses include in-batch softmax, while evaluation follows leave-one-out standards prevalent for implicit feedback tasks.

Researchers are directed to the technical appendices of the original paper for detailed LaTeX expressions relevant to loss functions and experimental procedures.

6. Research Implications and Outlook

MicroLens-100K marks a significant methodological advance in public micro-video recommendation datasets due to its scale, content richness, and modal diversity. Its design encourages the development and empirical evaluation of next-generation models that transcend traditional ID-based recommendation, foregrounding the importance of raw content ingestion and cross-modal signal integration.

By unifying collaborative filtering, content modeling, and video understanding within a common data and evaluation framework, MicroLens-100K enables meaningful progress on open challenges such as content-cold start, cross-modal alignment, and the transferability of classification-based video encoders to user preference modeling.

This dataset provides a critical resource for exploring intersections of recommendation and video understanding research, serving as both a benchmark and a foundational pre-training corpus for subsequent advances in multimodal recommender system development (Ni et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)