Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Demographic Inference and Representative Population Estimates from Multilingual Social Media Data (1905.05961v1)

Published 15 May 2019 in cs.CY, cs.CL, cs.CV, and cs.LG

Abstract: Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts. In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zijian Wang (99 papers)
  2. Scott A. Hale (48 papers)
  3. David Adelani (7 papers)
  4. Przemyslaw A. Grabowicz (21 papers)
  5. Timo Hartmann (7 papers)
  6. Fabian Flöck (12 papers)
  7. David Jurgens (69 papers)
Citations (194)

Summary

Demographic Inference from Multilingual Social Media Data

The paper presented focuses on the development of a methodology to infer demographic information—specifically age, gender, and organization-status—from multilingual social media data, and to correct biases in non-representative sampling attributable to the use of such data in broader population studies. This research is particularly critical for utilizing social media as a lens for "social sensing," a method that could provide immediate insights into societal dynamics and public sentiment on a global scale.

A significant contribution of this work is the introduction of a multimodal deep neural architecture, referred to as the M3 model, which is designed to classify age, gender, and organization-status across 32 languages. This model steps beyond the typical monolingual approaches, which have limited applicability in a global context. It incorporates a combination of profiles images, usernames, screen names, and biographical text to make multi-attribute demographic inferences. The model achieves a superior performance compared to preexisting methods, demonstrating substantive reductions in both algorithmic bias and error, particularly when applied to multi-lingual settings.

The paper implements several innovative machine learning techniques in the M3 model, notably: character-based neural networks equipped with language embeddings to handle multilingual text processing, co-training methodologies to leverage both image and text modalities, and multilingual data augmentation to address the linguistic imbalance within training datasets. This comprehensive approach not only enhances demographic inference precision but also equips the model to better understand cross-language nuances and patterns.

In parallel, the paper formalizes a framework for debiasing non-representative samples based on estimated inclusion probabilities, which are essential for adjusting social media samples to more accurately reflect broader demographic distributions. This is accomplished through the use of post-stratification regression models that adjust for sampling biases with respect to national populations based on age and gender distributions obtained from ground-truth census data. The paper's methods show a significant reduction in error rates for predicting regional population counts across European countries.

Notably, the debiasing framework is evaluated under multiple scenarios, highlighting its adaptability to different extents of available demographic information. The strong performance of these models underpins the claim that complex demographic patterns can be reliably inferred and utilized for more accurate social measurement—especially when fine granularity of age and gender is crucial for the target application.

By allowing more representative social sensing using multilingual social media, this research lays a critical foundation for the potential of social media analytics in echoing the demographic realities of underlying populations. The trajectory of future research could explore further enhancing model interpretability and extending demographic attributes beyond the current focus, which could include characteristics like socioeconomic status or educational background, thereby broadening the scope and applicability of such models in real-world settings.

In summary, this work represents a robust advancement in utilizing social media data for demographic analysis, presenting a strategic fusion of machine learning innovations and statistical methodologies to address the inherent challenges of non-representative data samples typical of these platforms. Its implications span numerous disciplines, offering a reliable tool for refined demographic research and enhanced societal understanding through data science.