Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Who&When Dataset: Cross-Platform Insights

Updated 15 September 2025
  • The Who&When Dataset is a composite resource linking detailed Twitter user data with YouTube metadata to analyze who watches and shares videos over time.
  • It reveals superlinear scaling where early Twitter shares significantly boost YouTube views, highlighting the influence of retweet reputation and secondary network exposure.
  • The study employs regression modeling and clustering techniques to uncover demographic, geographic, and political patterns that shape sharing behaviors across platforms.

The Who&When Dataset is a composite resource integrating user-centric Twitter data with video-centric YouTube data to enable large-scale, multidimensional analysis of online viewership and sharing behaviors, specifically focusing on who watches and shares which YouTube videos and at what temporal intervals. Built from a pairing of 87,000 Twitter users, 5.6 million YouTube videos, and 15 million Twitter sharing events over a sampled 28-hour period, this dataset allows cross-platform correlation beyond what could be achieved with either source in isolation.

1. User Demographics and Behavioral Features

User profiling within the Who&When Dataset encompasses demographic distinctions and behavioral quantification. Demographics (inferred gender, age proxies, urban/rural location classification by geo-coding Twitter profile data, and interest extraction via WeFollow directory cross-referencing) are combined with behavioral measures such as tweet frequency, retweet fraction, and prevalence of URLs/hashtags.

Users are clustered via agglomerative hierarchical methods using cosine similarity over normalized YouTube category sharing distributions. Analyses establish correlations between Twitter activity and video sharing patterns: clusters of sports content sharers are predominantly male, whereas entertainment and people/blog-centric clusters skew female. Geographic and political orientation further stratify behavioral patterns: urban users exhibit a faster median sharing lag (143 hours for non-promotional accounts) compared to rural users (157 hours). Right-leaning users share News & Politics content roughly three days earlier than their left-leaning counterparts. "Influence" is indexed by retweet rates rather than follower counts, with influential users catalyzing higher view counts for shared videos.

2. Video Metadata, Popularity Metrics, and Polarization

Video-side features derive directly from YouTube’s API, incorporating categorical metadata (e.g., Music, Gaming, News/Politics), uploader information, view and like/dislike counts, and Freebase topic tags (legacy semantic labels). The paper introduces a rescaled polarization metric that accounts for nonlinear scaling in like/dislike behavior:

Pol(v)=Lv/Vv0.849Dv/Vv0.884Pol(v) = \frac{L_v / V_v^{0.849}}{D_v / V_v^{0.884}}

where LvL_v, DvD_v, and VvV_v denote likes, dislikes, and view counts for video vv, respectively.

A power-law relationship between initial Twitter shares (SS) and final YouTube views (VV) is observed for non-promotional users:

VSα,α2.18V \propto S^{\alpha}, \quad \alpha \approx 2.18

This substantiates a superlinear scaling effect—early Twitter sharing significantly amplifies downstream viewership. Meanwhile, social impact features (mean retweet count of sharers, secondary exposure measured by followers-of-followers) outperform raw follower sums in predicting final popularity, indicating that the structural position and reputation of early sharers matters more than sheer exposure volume.

3. Sharing Event Temporality and Dynamic Patterns

Temporal analysis hinges on the lag Δt\Delta t between video upload and Twitter sharing event. System-wide, a coherent onset of attention emerges: videos are typically shared within hours to days, but sharing probability declines rapidly after a category-dependent time threshold.

Category-specific median lags reflect differentiated temporal attention patterns:

  • Gaming: 8 hours
  • News/Politics: 15 hours
  • Movies/Trailers: Several months

Promotional accounts (detected via username overlap and abnormal sharing rates) disseminate content much more rapidly than non-promotional accounts (median lag: 18 hours vs. 38 hours, respectively). Urban/rural divisions and political orientation additionally structure sharing speeds.

4. Data Integration, Feature Construction, and Regression Modeling

The dataset construction process involves multi-staged enrichment and linkage:

  • Extraction of YouTube IDs from tweets; metadata retrieval via YouTube API.
  • Profile enrichment for Twitter users: demographics, interests, geo-location classification (via Yahoo’s Placemaker), behavioral metrics.
  • Promotional account filtering (longest common substring heuristics, excessive sharing rates).
  • Cosine-based clustering over shared video category distributions.

The paper introduces a regression model for predicting final video view count as a function of first-week sharing activity, user social impact, and exposure metrics:

VvSvαIvβEvγEvδAvκV_v \propto S_v^{\alpha} \cdot I_v^{\beta} \cdot E_v^{\gamma} \cdot \mathcal{E}_v^{\delta} \cdot A_v^{\kappa}

Here:

  • SvS_v = total first-week shares
  • IvI_v = aggregated mean retweet rate
  • EvE_v = first-order exposure (sum of followers)
  • Ev\mathcal{E}_v = second-order exposure (followers of followers)
  • AvA_v = share of voice

Log-transforming yields a linear regression model, fitted separately for promotional and non-promotional strata. The model accounts for 10–20% of variance in the log(final view count), validated via classification into popular/non-popular categories.

5. Cross-Platform Findings and Novel Contributions

The Who&When Dataset is the first at this scale to link granular Twitter activity with comprehensive YouTube metadata and sharing timelines. Key contributions include:

  • Discovery of superlinear scaling between early Twitter shares and final YouTube views.
  • Quantification of the predictive power of retweet reputation and secondary network exposure.
  • Elucidation of demographic, interest-based, geographic, and political correlates of sharing behavior.
  • Category-dependent temporal response patterns for video attention on social media.

Applications encompass audience analytics, creator marketing strategies, recommendation/reputation algorithm design, and spam-detection tools. The integration paradigm and analytical methodologies set a precedent for future cross-platform social-media research.

6. Limitations and Methodological Considerations

The dataset merges two time-fixed samples—Twitter sharing events and YouTube video metadata—potentially restricting longitudinal inference beyond the sampled intervals. Promotional account detection, while heuristic, may inadvertently exclude legitimate power users. Model explanatory power remains modest (peak R2R^2 ≈ 0.2), indicating that popularity is multi-causal and partly unpredictable even with rich cross-platform features.

Interpretations suggest that incorporating higher-dimensional network analytics, expanding temporal coverage, and better distinguishing organic from platform-driven sharing may further improve predictive accuracy.

7. Implications for Future Research and Practice

The Who&When Dataset provides a template for joint social-network and media feature integration, demonstrating that early cross-platform indicators (sharing frequency, retweet dynamics) are robust (if incomplete) signals of subsequent popularity. Content producers, marketers, and algorithm designers can leverage demographic/interest-based stratification and early sharing metrics for refined targeting and forecasting. For empirical research, the framework enables exploration of the social contagion mechanisms underpinning digital content virality and supports the development of advanced, cross-platform analytical models for viewership and engagement dynamics.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Who&When Dataset.