Tube2Vec: Social and Semantic Embeddings of YouTube Channels (2306.17298v1)
Abstract: Research using YouTube data often explores social and semantic dimensions of channels and videos. Typically, analyses rely on laborious manual annotation of content and content creators, often found by low-recall methods such as keyword search. Here, we explore an alternative approach, using latent representations (embeddings) obtained via machine learning. Using a large dataset of YouTube links shared on Reddit; we create embeddings that capture social sharing behavior, video metadata (title, description, etc.), and YouTube's video recommendations. We evaluate these embeddings using crowdsourcing and existing datasets, finding that recommendation embeddings excel at capturing both social and semantic dimensions, although social-sharing embeddings better correlate with existing partisan scores. We share embeddings capturing the social and semantic dimensions of 44,000 YouTube channels for the benefit of future research on YouTube: https://github.com/epfl-dlab/youtube-embeddings.
- The Pushshift Reddit Dataset. Proceedings of the International AAAI Conference on Web and Social Media 14: 830–839. ISSN 2334-0770. doi:10.1609/icwsm.v14i1.7347. URL https://ojs.aaai.org/index.php/ICWSM/article/view/7347.
- Stuart’s tau measure of effect size for ordinal variables: Some methodological considerations. Behavior Research Methods 41(4): 1144–1148. ISSN 1554-3528. doi:10.3758/BRM.41.4.1144. URL https://doi.org/10.3758/BRM.41.4.1144.
- Reading tea leaves: How humans interpret topic models. In Bengio, Y.; Schuurmans, D.; Lafferty, J.; Williams, C.; and Culotta, A., eds., Advances in neural information processing systems, volume 22. Curran Associates, Inc. URL https://proceedings.neurips.cc/paper/2009/file/f92586a25bb3145facd64ab20fd554ff-Paper.pdf.
- Dinkov, Y. 2018. youtube-political-bias. URL https://www.kaggle.com/datasets/yoandinkov/youtubepoliticalbias.
- node2vec: Scalable Feature Learning for Networks. doi:10.48550/arXiv.1607.00653. URL http://arxiv.org/abs/1607.00653. ArXiv:1607.00653 [cs, stat].
- Examining the consumption of radical content on YouTube. Proceedings of the National Academy of Sciences 118(32): e2101967118.
- Algorithmic Extremism: Examining YouTube’s Rabbit Hole of Radicalization. URL http://arxiv.org/abs/1912.11211. ArXiv:1912.11211 [cs].
- Healthcare information on YouTube: A systematic review. In Health Informatics Journal.
- Are Anti-Feminist Communities Gateways to the Far Right? Evidence from Reddit and YouTube. In 13th ACM Web Science Conference 2021, WebSci ’21, 139–147. New York, NY, USA: Association for Computing Machinery. ISBN 978-1-4503-8330-1. doi:10.1145/3447535.3462504. URL https://doi.org/10.1145/3447535.3462504.
- Fast and accurate inference of Plackett–Luce models. Advances in neural information processing systems 28.
- Disturbed YouTube for kids: Characterizing and detecting inappropriate videos targeting young children. In Proceedings of the international AAAI conference on web and social media, volume 14, 522–533.
- DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 701–710. doi:10.1145/2623330.2623732. URL http://arxiv.org/abs/1403.6652. ArXiv:1403.6652 [cs].
- Crosslingual Topic Modeling with WikiPDA. In Proceedings of the Web Conference 2021, WWW ’21, 3032–3041. New York, NY, USA: Association for Computing Machinery. ISBN 978-1-4503-8312-7. doi:10.1145/3442381.3449805. URL https://doi.org/10.1145/3442381.3449805.
- YouNiverse: Large-Scale Channel and Video Metadata from English-Speaking YouTube. doi:10.48550/arXiv.2012.10378. URL http://arxiv.org/abs/2012.10378. ArXiv:2012.10378 [cs].
- LINE: Large-scale Information Network Embedding. In Proceedings of the 24th International Conference on World Wide Web, 1067–1077. doi:10.1145/2736277.2741093. URL http://arxiv.org/abs/1503.03578. ArXiv:1503.03578 [cs].
- Imagine All the People: Characterizing Social Music Sharing on Reddit. Proceedings of the International AAAI Conference on Web and Social Media 15: 739–750. ISSN 2334-0770, 2162-3449. doi:10.1609/icwsm.v15i1.18099. URL https://ojs.aaai.org/index.php/ICWSM/article/view/18099.
- Quantifying social organization and political polarization in online platforms. Nature 600(7888): 264–268. ISSN 1476-4687. doi:10.1038/s41586-021-04167-x. URL https://www.nature.com/articles/s41586-021-04167-x. Number: 7888 Publisher: Nature Publishing Group.
- MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. doi:10.48550/arXiv.2002.10957. URL http://arxiv.org/abs/2002.10957. ArXiv:2002.10957 [cs].