A Large-scale Dataset with Behavior, Attributes, and Content of Mobile Short-video Platform

Published 9 Feb 2025 in cs.MM | (2502.05922v1)

Abstract: Short-video platforms show an increasing impact on people's daily lives nowadays, with billions of active users spending plenty of time each day. The interactions between users and online platforms give rise to many scientific problems across computational social science and artificial intelligence. However, despite the rapid development of short-video platforms, currently there are serious shortcomings in existing relevant datasets on three aspects: inadequate user-video feedback, limited user attributes and lack of video content. To address these problems, we provide a large-scale dataset with rich user behavior, attributes and video content from a real mobile short-video platform. This dataset covers 10,000 voluntary users and 153,561 videos, and we conduct four-fold technical validations of the dataset. First, we verify the richness of the behavior and attribute data. Second, we confirm the representing ability of the content features. Third, we provide benchmarking results on recommendation algorithms with our dataset. Finally, we explore the filter bubble phenomenon on the platform using the dataset. We believe the dataset could support the broad research community, including but not limited to user modeling, social science, human behavior understanding, etc. The dataset and code is available at https://github.com/tsinghua-fib-lab/ShortVideo_dataset.

Abstract PDF Upgrade to Chat

Summary

The paper presents a large-scale dataset capturing 10,000 user behaviors and 153,561 videos with extensive user and video attributes.
The dataset employs proxy-based data collection and t-SNE visualizations to validate semantic clustering of video content.
The benchmark analysis demonstrates the effectiveness of multimodal recommendation systems and examines filter bubble effects while ensuring ethical data handling.

Understanding the Large-scale Dataset for Mobile Short-video Platforms

The paper "A Large-scale Dataset with Behavior, Attributes, and Content of Mobile Short-video Platform" (2502.05922) introduces a comprehensive dataset aimed at facilitating advanced research in computational social science and AI within the domain of mobile short-video platforms. This dataset, including user behavior, attributes, and video content, addresses existing gaps in publicly available data, offering rich insights for various research directions.

Dataset Overview

The dataset comprises interactions involving 10,000 voluntary users and 153,561 videos, offering a robust framework for studying user behavior and video characteristics on short-video platforms. It curiously integrates user and video attributes, thereby expanding the scope of analysis compared to prior datasets. The collection methods abide by privacy concerns, ensuring ethical handling of user data.

Figure 1: The illustration of user interface and behaviors on the platform (a) and an overview of the dataset (b).

The data acquisition involved installing a proxy agent on user devices to record interactions comprehensively, accumulating diverse preference signals such as explicit feedback (likes, comments, follows) and implicit feedback (watch durations).

Attributes and Content Coverage

The dataset's novelty is embodied in the breadth and depth of user and video attributes it captures. The video attributes include hierarchical categories, author details, durations, video tags, among others—enabling multi-level semantic content understanding. Additionally, user attributes extend beyond demographics to include geographical and device characteristics.

Figure 2: Interaction number distribution of (a) users and (b) videos.

Figure 3: Distribution of some key fields in user attributes.

Preprocessed visual features extracted from video content were validated for quality using t-SNE embeddings, affirming distinct clustering per video category that substantiates robust semantic information capture.

Figure 4: Embedding visualization of videos with different (a) Category \uppercase\expandafter{\romannumeral1} and (b) Category \uppercase\expandafter{\romannumeral3}.

Benchmarking and Validation

The dataset serves as a benchmark for evaluating recommendation algorithms, incorporating eight models, including general (e.g., BPR, LightGCN) and multimodal approaches. Results reinforce the efficacy of multimodal data fusion, particularly validating BM3's performance in leveraging complex input forms.

Filter Bubble Analysis

A significant application was studying the filter bubble phenomenon, where algorithm-driven content curation potentially impairs exposure diversity. The paper analyzed bubble extent through user interaction with categorized video content. Active user patterns exhibited stable bubble ratios, offering insights into usage dynamics paralleling filter bubble avoidance in heavy users.

Figure 5: Analysis of the filter bubble ratio of active users (a) and inactive users (b) over time in our dataset.

Ethical Considerations

The dataset's collection complies with stringent ethical norms akin to GDPR, emphasizing anonymity and secure data handling. Participant consent was pivotal, underscoring a trust-based framework for data-derived insights without breaching confidentiality.

Conclusion

This large-scale dataset is poised to swiftly advance research in user modeling, recommendation systems, and social behavior analysis on short-video platforms, providing foundational data for developing new algorithms and exploring novel theoretical concepts. Future work hints at enhancing video content granularity and expanding dataset duration to capture evolving user behaviors better.

Through rigorous validation, ethical sourcing, and broad applicability, this dataset sets a benchmark, inviting further exploration and collaboration within the AI research community.

Markdown