Papers
Topics
Authors
Recent
2000 character limit reached

YouTube Shorts Dataset Overview

Updated 31 December 2025
  • YouTube Shorts Dataset refers to curated collections of short-form YouTube videos (<60s) with rich multimodal metadata and specialized labeling.
  • It integrates advanced data collection pipelines including keyword crawls, API scrapes, and expert annotations to capture user behavior and recommendation paths.
  • The structured metadata—featuring transcripts, engagement metrics, and recommendation logs—enables robust research on algorithmic bias, affective computing, and digital media safety.

YouTube Shorts Dataset is a collective term denoting datasets explicitly designed to capture, analyze, and benchmark the vast and rapidly evolving landscape of YouTube Shorts—vertical short-form videos, typically less than 60 seconds, distributed on a platform now serving over 2 billion monthly users. These datasets provide comprehensive metadata, multimodal content representations (visual, audio, transcript), labeling schemas for affect and topic, cross-platform benchmarks, and user behavior logs, offering foundational resources for academic research in algorithmic bias, recommendation systems, affective computing, social science, quality assessment, and digital media safety.

1. Dataset Construction and Collection Strategies

Recent benchmark YouTube Shorts datasets follow rigorous, multi-stage data collection pipelines adapted for platform-level scale and legal compliance. Cakmak et al. (Cakmak et al., 7 Jul 2025) assembled their corpus via seed-based keyword crawls, leveraging APIFY’s YouTube Scraper (streamers/youtube-scraper), YouTube Data API v3 for static metadata, and custom transcript pipelines. Scrapes targeted three principal domains—general content, the South China Sea dispute, and the 2024 Taiwan presidential election—with recommendation chains explored using Selenium-driven simulated watch-time regimes (3 s, 15 s, 60 s) under strict browser isolation (no cookies, logged out). Selection criteria ensured inclusion of only Shorts (≤ 60s) and exhaustive topical depth via multi-year and event-specific coverage.

Repurpose-10K (Wu et al., 2024) approaches the "long-to-short" video repurposing challenge for Shorts via a two-stage annotation: automated topical segmentation followed by user-driven cut selection and expert timestamp refinement, explicitly balancing annotation bias and maximizing UGC veracity. MetaHarm (Jo et al., 22 Apr 2025) employs keyword/channel-based search, leveraging external harm datasets and systematic frame extraction (yt_dlp + OpenCV, thumbnails + 15 frames) to create multimodal samples for online harm analysis.

Efficient recommendation-bias studies (Dagtas et al., 29 Jul 2025) utilize Python multiprocessing and Selenium to parallelize dynamic crawl of recommendation chains, directly scraping browser DOM due to API limitations for Shorts. All frameworks emphasize deduplication, manual review for content appropriateness, and comprehensive logging of recommendation paths and watch-time conditions.

2. Scope, Scale, and Structural Properties

YouTube Shorts datasets span orders of magnitude from event-specific corpora to near–platform-scale repositories:

Corpus (Paper) Unique Videos Domains / Topics Temporal Coverage
Cakmak et al. (Cakmak et al., 7 Jul 2025) 685,842 General, SCS, Taiwan election 2022–2024, Jan 2024
Repurpose-10K (Wu et al., 2024) 120,925* 8 UGC-driven categories ~4,540 hr, ongoing
MetaHarm (Jo et al., 22 Apr 2025) 19,422 (annot.) Clickbait, hate, info., etc. 2022–2024+
Quality (Wang et al., 2024) 4,030 10 genre categories Recent, HDR pool
Shorts vs. Reg. (Violot et al., 2024) 16.75M 15 YouTube categories 2021–2022

(*Repurpose-10K: annotated clips; source videos = 11,210)

Cakmak et al. report domain distributions (general: 322,687, SCS: 320,724, Taiwan: 42,431). Repurpose-10K content taxonomy (inspired by YouTube’s own): Vlogs & Lifestyle, Tutorials & How-Tos, Podcasts & Interviews, Travel & Adventure, Cooking & Food, Gaming, Sports & Fitness, Other. The Shorts vs. Regular Videos dataset (Violot et al., 2024) tracks 70,712 channels across over 16M videos longitudinally.

3. Metadata, Features, and Labeling Schemas

Datasets typically offer the following core video fields:

  • video_id (str): Unique YouTube ID
  • title (str): Short descriptive text
  • transcript (str): Full auto-generated, Whisper-derived, or human-refined transcript
  • view_count, like_count, comment_count (int): Engagement metrics
  • seed_id (str): Reference video for recommended chains
  • depth (int): Recommendation chain position
  • watch_time_condition (int): Simulated watch duration
  • crawl_timestamp (str): ISO8601 crawl time

Labeling methodologies vary by task. Cakmak et al. (Cakmak et al., 7 Jul 2025) leverage GPT-4o for prompt-based classification, normalizing relevance across four levels (none, low, medium, high: divided by 3 for [0,1] scores), topic labels (“politics,” “non-entertainment,” “entertainment”), and emotion labels (“joy/happiness,” “neutral,” “sadness,” “anger,” “fear”). The prompt-based schema is used uniformly for zero-/one-shot AI-driven classification.

In the domain of harmful content, MetaHarm (Jo et al., 22 Apr 2025) annotates each video with binary (“harmful”/“harmless”) and six non-mutually exclusive multi-label categories—information harms, hate & harassment, addictive, clickbait, sexual, physical—using consensus among domain experts, LLM annotators (GPT-4-Turbo with both frames and textual input), and crowdsourced MTurk master workers. Agreement is quantified via Holsti’s index (H=0.88), Cohen's κ=0.76, Krippendorff's α for LLM (0.78) and crowd (0.21).

4. Analytical Frameworks: Algorithmic Bias, Engagement, Affect, and Quality

Algorithmic bias is central to Cakmak et al.’s analysis (Cakmak et al., 7 Jul 2025), where the dataset supports quantification of recommendation drift (“consistent drift away from politically sensitive content toward entertainment-focused videos”), emotional preference (systematic favoring of “joyful or neutral” content), and popularity bias (highly viewed/liked videos are promoted disproportionately). Recommendation chains under three watch-time conditions simulate exposure, drift, and recency effects. Metrics such as relevance scores, topic/category entropy, and engagement indicators (likes/views/comments per video) provide further granularity.

SFV+HDR Quality (Wang et al., 2024) introduces mean opinion scores (MOS: slider rating aggregation) across 4030 videos spanning SDR, HDR2SDR, and native HDR, categorized by 10 genres. Objective VQA benchmarking employs DOVER, FAST-VQA, and FasterVQA, reporting PLCC/SRCC correlations against MOS, revealing content-dependent metric reliability and lower performance in tone-mapped HDR and gameplay genres.

The Shorts vs. Regular Videos dataset (Violot et al., 2024) provides derived engagement metrics (views, likes/comments per view, channel-normalized upload rates), temporal evolution of category distribution, and comparative performance for Shorts in entertainment, education, and politics.

Emotion multimodality, as in MSEVA/bili-news (Wei et al., 2023), extends the framework with audio segmentation (silence-driven), Whisper-based transcripts, and consensus annotation on the PANAS affect scale (“positive,” “negative,” “uncertain”). Comparative ablations highlight that full transcript input significantly improves affect classification over video title alone.

5. Data Access, Ethics, and Technical Validation

Access modalities for Shorts datasets vary. Some are released under CC BY-NC-SA (MetaHarm (Jo et al., 22 Apr 2025), VCSL short-video (Yanagi et al., 2024), SFV+HDR (Wang et al., 2024), MSEVA/bili-news (Wei et al., 2023)); others (Cakmak et al., Shorts vs. Regular) require direct author request or participation in Google’s YouTube Researcher Program due to YouTube Terms of Service restrictions. Datasets are typically provided as JSON, CSV, or Parquet schemas with documented record layouts (see (Wang et al., 2024, Dagtas et al., 29 Jul 2025)), alongside notebook scripts, loading utilities, and reproducible split procedures.

Technical validation is fourfold in full-stack datasets (Shang et al., 9 Feb 2025): coverage analysis for user-video interactions, demographic/geographic spread, embedding quality (t-SNE visualizations for category separation), and downstream benchmark performance (rec. algorithms: BM3, LightGCN, MMGCN, etc.). Ethical practices mandate anonymization, opt-out, and minimal geodata granularity; licenses require non-commercial and academic use.

6. Benchmark Tasks, Limitations, and Future Opportunities

YouTube Shorts datasets support benchmark tasks in recommendation, algorithmic bias measurement, copy detection, video repurposing, quality estimation, and affective analysis. Short-video copy detection (Yanagi et al., 2024) reveals segmentation bottlenecks: segment-level F1 drops sharply with shorter clips, but video-level retrieval becomes easier as distractors decrease. Repurpose-10K (Wu et al., 2024) enables temporal grounding evaluation with tIoU and recall/precision/F1 metrics; ablations emphasize the advantage of multimodal (A/V/C) fusion, and limitations highlight ASR robustness, visual creative transitions, and coverage breadth.

Identified limitations across datasets include: non-random seed selection (sampling bias), edge cases in Shorts identification (pre-2021 verticals), constraints on engagement metric timing, and platform licensing restrictions. Recommendations include integrating granular mood/style embeddings, joint segment selection/transition generation, and expanded social science studies on filter bubbles, bias, and information diversity.

A plausible implication is that further refinement of recommendation chain logging, multimodal annotation, and open benchmarks tailored to Shorts’ unique dynamics will continue to drive advances in both machine learning and social media research.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to YouTube Shorts Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube