Sampled Datasets Risk Substantial Bias in the Identification of Political Polarization on Social Media (2406.19867v1)

Published 28 Jun 2024 in cs.SI and cs.CY

Abstract: Following recent policy changes by X (Twitter) and other social media platforms, user interaction data has become increasingly difficult to access. These restrictions are impeding robust research pertaining to social and political phenomena online, which is critical due to the profound impact social media platforms may have on our societies. Here, we investigate the reliability of polarization measures obtained from different samples of social media data by studying the structural polarization of the Polish political debate on Twitter over a 24-hour period. First, we show that the political discussion on Twitter is only a small subset of the wider Twitter discussion. Second, we find that large samples can be representative of the whole political discussion on a platform, but small samples consistently fail to accurately reflect the true structure of polarization online. Finally, we demonstrate that keyword-based samples can be representative if keywords are selected with great care, but that poorly selected keywords can result in substantial political bias in the sampled data. Our findings demonstrate that it is not possible to measure polarization in a reliable way with small, sampled datasets, highlighting why the current lack of research data is so problematic, and providing insight into the practical implementation of the European Union's Digital Service Act which aims to improve researchers' access to social media data.

Authors (8)

Gabriele Di Bona (9 papers)
Emma Fraxanet (4 papers)
Björn Komander (1 paper)
Andrea Lo Sasso (1 paper)
Virginia Morini (10 papers)
Antoine Vendeville (9 papers)
Max Falkenberg (12 papers)
Alessandro Galeazzi (25 papers)

Summary

Analyzing Bias in Political Polarization Studies on Social Media

Social media platforms serve as significant battlegrounds for political discourse, often exacerbating polarization through echo chambers and selective exposure. However, the reliability of data used in studies on political polarization is under question, especially given the restrictive data policies of platforms like X (formerly Twitter). The paper "Sampled datasets risk substantial bias in the identification of political polarization on social media" explores the effects of data accessibility constraints on the measurement of political polarization using sampled datasets. The research focuses specifically on the structural polarization of the Polish political debate on Twitter over a 24-hour period, providing crucial insights for social media research and policy implementation.

Core Findings

The researchers present three primary findings regarding the reliability of sampled social media data for measuring political polarization:

Subset of Political Discussions: The paper indicates that political discourse forms only a small subset of the broader discussions happening on social media platforms. This distinction is crucial, as non-political content can dilute or obscure the identification of political polarization.
Sample Size and Representativeness: The researchers demonstrate that while large samples can accurately represent the political discussion on a platform, small samples consistently fail to do so. For instance, using random samples of 40% or more of the data yields ideology distributions closely approximating those obtained from the full dataset. In contrast, smaller samples do not produce statistically significant measures of bimodal ideology distribution.
Keyword-based Sampling Biases: Although keyword-based samples can be representative if chosen meticulously, poorly selected keywords can introduce substantial biases. Keywords associated with a specific political ideology may skew the results, either underrepresenting or overrepresenting particular sides of the political spectrum.

Methodology

The paper adopts a multifaceted approach to assess the reliability and biases of different sampling techniques:

Random Sampling: Random subsets of varying sizes are extracted from the full dataset to evaluate how the sample size impacts the identification of polarization.
Keyword-based Sampling: A careful selection of keywords related to general political terms and prominent political figures is used to filter data.
Seed-based Sampling: Influential political accounts are used as seeds to gather interactions, analyzing how varying the number of these seeds affects the polarization measures.

To estimate the ideological positions of users, the paper employs latent ideology estimation modeled after Barbera’s approach. The ideological landscape is quantified using metrics such as Hartigan's Dip Test for multimodality and Wasserstein distance for comparing distributional similarities to the baseline.

Numerical Findings

The numerical results underline several critical points:

Polarization Metrics: When employing the Hartigan Dip Test, the D statistic becomes significant only when samples exceed 40% of the full dataset. Similarly, the Wasserstein distance decreases sharply for larger samples, indicating closer alignment with the true ideology distribution.
Effect of Influencer Selection: Varying the number of influential seed accounts shows that using at least 20% of the top political influencers yields stable polarization measures.
Graph Measures: The integrity of the retweet network, measured by the relative size of the largest weakly connected component (LWCC), corroborates the findings that insufficient data leads to fragmented, non-representative networks.

Implications and Future Research

The findings of this paper carry profound implications for social media research and policy frameworks like the European Union's Digital Services Act (DSA). The evidence supports a need for comprehensive data access for accurate research conclusions, highlighting the drawbacks of relying on small or poorly selected samples. The practical implications extend to enhancing the guidelines and technical requirements under legislations like the DSA to ensure robust and representative data provision by social media platforms.

In theoretical terms, the paper calls for refined methodologies to measure political polarization accurately, especially in constrained data environments. Future research could extend these methods to other social media platforms and geographical contexts to generalize the findings. Additionally, exploring alternative quantitative and graph-based measures of polarization could offer more nuanced insights.

In conclusion, this paper provides a meticulous analysis of the biases introduced by data sampling techniques in studying political polarization on social media. The results underscore the importance of access to comprehensive datasets and the need for careful selection of sampling methods to avoid skewed research outcomes. This work serves as a foundational guideline for future computational social science research and policy implementation in the context of evolving data accessibility challenges.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/MaxFalken/status/1807713178693910917