How Well Do My Results Generalize Now? The External Validity of Online Privacy and Security Surveys

Published 28 Feb 2022 in cs.HC | (2202.14036v2)

Abstract: Privacy and security researchers often rely on data collected through online crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) and Prolific. Prior work -- which used data collected in the United States between 2013 and 2017 -- found that MTurk responses regarding security and privacy were generally representative for people under 50 or with some college education. However, the landscape of online crowdsourcing has changed significantly over the last five years, with the rise of Prolific as a major platform and the increasing presence of bots. This work attempts to replicate the prior results about the external validity of online privacy and security surveys. We conduct an online survey on MTurk (n=800), a gender-balanced survey on Prolific (n=800), and a representative survey on Prolific (n=800) and compare the responses to a probabilistic survey conducted by the Pew Research Center (n=4272). We find that MTurk response quality has degraded over the last five years, and our results do not replicate the earlier finding about the generalizability of MTurk responses. By contrast, we find that data collected through Prolific is generally representative for questions about user perceptions and experiences, but not for questions about security and privacy knowledge. We also evaluate the impact of Prolific settings, attention check questions, and statistical methods on the external validity of online surveys, and we develop recommendations about best practices for conducting online privacy and security surveys.

Abstract PDF Upgrade to Chat

Citations (25)

View on Semantic Scholar

Summary

The paper replicates earlier studies to reveal significant degradation in MTurk data quality relative to national survey benchmarks.
It employs replication, raking, and various attention checks to assess the strengths and limitations of current survey methodologies.
Conversely, Prolific demonstrates superior external validity, indicating its potential as a more reliable platform for privacy and security research.

External Validity of Online Privacy and Security Surveys

The paper "How Well Do My Results Generalize Now? The External Validity of Online Privacy and Security Surveys" (2202.14036) explores the external validity of online surveys conducted via platforms like Amazon Mechanical Turk (MTurk) and Prolific, specifically concerning privacy and security topics. It attempts to replicate prior findings and assess current best practices in survey methodologies.

Introduction

Online crowdsourcing platforms have become vital tools in privacy and security research, yet their external validity—whether results generalize to the broader population—remains a critical issue. Researchers traditionally rely on MTurk, which had shown some degree of representativeness for U.S. respondents under 50 or those with higher education levels. However, shifts in platform dynamics and participant demographics necessitate a reevaluation.

Methodology

The study replicates prior work by conducting surveys on MTurk and Prolific (both representative and gender-balanced samples), comparing results to the Pew Research Center's probabilistic survey. Several factors are investigated, including attention check effectiveness, demographic weighting (raking), and platform-specific settings.

MTurk Results Degradation

Analysis reveals a significant decline in MTurk's data quality and generalizability. While prior work highlighted representativeness under certain demographic constraints, the current study finds statistically significant discrepancies in nearly every question category, suggesting lower reliability now. Methodologies such as raking and attention checks modestly improve quality but fail to restore previous validity levels.

Figure 1: Behavior Questions. Mean TVD (Representative, Balanced) = .22, .27

Prolific's Superior Generalizability

Conversely, Prolific demonstrates higher external validity, especially in user experience and perception categories. Both representative and gender-balanced samples show closer approximation to national data, although knowledge-based and social media behavior questions exhibit deviations, likely reflecting the platform's tech-savvy user base.

Data Quality Measures and Their Impact

Raking and free-response attention checks improve sample quality marginally, while reading-based checks and CAPTCHAs offer little benefit. This suggests that Prolific's existing filters are sufficient for maintaining serviceable data integrity.

Implications for Survey Design

Researchers should prefer Prolific when designing privacy and security surveys to ensure data robustness. Furthermore, distinguishing between representative samples and gender-balanced samples may hinge on logistical constraints rather than substantive differences in data quality.

Conclusion

The study underscores a paradigm shift in the optimal platforms for privacy and security research. Prolific offers more reliable demographic representation, though researchers must remain cognizant of biases inherent in user base composition. Practices such as avoiding unnecessary attention checks can streamline survey execution without sacrificing data validity.

Markdown