Where You Are Is Who You Are: User Identification by Matching Statistics (1512.02896v1)

Published 9 Dec 2015 in cs.LG, cs.CR, cs.SI, stat.AP, and stat.ML

Abstract: Most users of online services have unique behavioral or usage patterns. These behavioral patterns can be exploited to identify and track users by using only the observed patterns in the behavior. We study the task of identifying users from statistics of their behavioral patterns. Specifically, we focus on the setting in which we are given histograms of users' data collected during two different experiments. We assume that, in the first dataset, the users' identities are anonymized or hidden and that, in the second dataset, their identities are known. We study the task of identifying the users by matching the histograms of their data in the first dataset with the histograms from the second dataset. In recent works, the optimal algorithm for this user identification task is introduced. In this paper, we evaluate the effectiveness of this method on three different types of datasets and in multiple scenarios. Using datasets such as call data records, web browsing histories, and GPS trajectories, we show that a large fraction of users can be easily identified given only histograms of their data; hence these histograms can act as users' fingerprints. We also verify that simultaneous identification of users achieves better performance compared to one-by-one user identification. We show that using the optimal method for identification gives higher identification accuracy than heuristics-based approaches in practical scenarios. The accuracy obtained under this optimal method can thus be used to quantify the maximum level of user identification that is possible in such settings. We show that the key factors affecting the accuracy of the optimal identification algorithm are the duration of the data collection, the number of users in the anonymized dataset, and the resolution of the dataset. We analyze the effectiveness of k-anonymization in resisting user identification attacks on these datasets.

Citations (100)

View on Semantic Scholar

Summary

User Identification via Behavioral Pattern Statistics

The paper "Where You Are Is Who You Are: User Identification by Matching Statistics" explores the challenge of identifying individuals through analysis of statistical data derived from their behavioral patterns in online activities. This paper specifically investigates the potential to recognize users even when their data appears in the form of anonymized histograms from diverse datasets. The authors implemented a novel method evaluated against three different datasets to explore the feasibility of user identification based on behavioral statistics alone.

Key Insights and Methodology

The paper's central hypothesis posits that users exhibit unique interaction patterns that can be exploited to deduce identities from anonymized data, essentially treating these patterns as digital fingerprints. The paper focuses on scenarios where users' data is presented as histograms, devoid of direct identifiers, yet rich enough to support identification when auxiliary information is available. The identification process involves matching histograms from two distinct experiments, one with anonymized user data and the other with labeled data.

The authors propose an optimal algorithmic solution using the minimum-weight maximal matching approach on a weighted bipartite graph. Edge weights are calculated using the Kullback-Leibler divergence between histograms, a method deemed optimal through asymptotic analysis of the error probabilities as data volume increases. The paper demonstrates that simultaneous identification of users across datasets yields higher performance than identifying users one at a time, leveraging the comprehensive nature of available data.

Empirical Evaluation

The identification method was validated across three datasets: Call Data Records (CDR), Web Browsing History (WBH), and GeoLife GPS traces. These datasets present differing challenges due to their nature—mobility patterns in urban areas, diverse web browsing behavior, and GPS-based tracking over months or years. The algorithm showed promising accuracy rates far exceeding that of heuristic approaches, confirming that even anonymized statistical data retains enough granularity to identify users.

For example, in the CDR dataset, where user movement is traceable through GSM antennas, the identification accuracy was 21.1% for nearly 47,000 users—a result achieved without relying on time-dependent data. The researchers further demonstrated that factors such as the number of users, data resolution, and the duration over which statistics are collected greatly influence identification accuracy.

Implications for Privacy

The paper explores the implications of this identification method for privacy, discussing enhancements like k-anonymization and data coarsening (obfuscation). While such methods suppress direct identifiers and reduce data granularity, the paper indicates that significant user identification remains feasible under moderate levels of data distortion, signaling potential challenges in anonymization strategies.

Future Directions

This research paves the way for future exploration in privacy-preserving data analytics, using statistical invariants of user behavior to balance the dual needs for identification and anonymity. It suggests further refinement of algorithms for increased accuracy and reduced computational complexity—especially as the size and diversity of datasets expand.

Emerging applications in personalized services, targeted advertising, and cybersecurity could benefit from more sophisticated models incorporating deeper understanding of user behaviors. In particular, artificial intelligence frameworks might integrate these statistical identification techniques to enhance model robustness and personalization while preserving privacy.

Overall, the paper presents a compelling examination of user identification through statistical analysis, highlighting new avenues for research in digital privacy and user data analytics. As datasets grow in complexity and volume, ensuring both identification accuracy and data privacy will become increasingly critical.

Related Papers

YouTube

Show All Videos