Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Home Location Identification of Twitter Users (1403.2345v1)

Published 7 Mar 2014 in cs.SI, cs.CL, and cs.CY

Abstract: We present a new algorithm for inferring the home location of Twitter users at different granularities, including city, state, time zone or geographic region, using the content of users tweets and their tweeting behavior. Unlike existing approaches, our algorithm uses an ensemble of statistical and heuristic classifiers to predict locations and makes use of a geographic gazetteer dictionary to identify place-name entities. We find that a hierarchical classification approach, where time zone, state or geographic region is predicted first and city is predicted next, can improve prediction accuracy. We have also analyzed movement variations of Twitter users, built a classifier to predict whether a user was travelling in a certain period of time and use that to further improve the location detection accuracy. Experimental evidence suggests that our algorithm works well in practice and outperforms the best existing algorithms for predicting the home location of Twitter users.

Citations (211)

Summary

  • The paper develops an ensemble approach that combines tweet content and user behavior, achieving 64% city-level accuracy, with further gains when filtering travelers.
  • It employs a geographic gazetteer for enhanced place-name recognition and a hierarchical framework that refines location predictions from state to city level.
  • The study demonstrates that integrating temporal tweeting patterns and mobility data significantly improves prediction accuracy across multiple geographic granularities.

Home Location Identification of Twitter Users

This paper introduces a novel algorithm for inferring the home location of Twitter users using tweet content and user behavior. The algorithm demonstrates effectiveness at identifying locations at several granularities, including city, state, and geographical region, utilizing an ensemble of statistical and heuristic classifiers. A distinguishing feature of this work is the integration of a geographic gazetteer dictionary to enhance place-name entity recognition, which is further refined through a hierarchical classification approach.

Key contributions include the development of an ensemble approach combining content-based and behavior-based classifiers, as well as an exploration of user mobility as an additional factor influencing prediction accuracy. The ensemble leverages classifiers trained on various features extracted from tweet content, such as words, hashtags, and geographical references, alongside user behavior patterns like tweeting frequency. The hierarchical approach first predicts broader location categories such as time zone or state and then refines predictions to the city level, resulting in improved accuracy compared to flat classification methods.

The algorithm's performance was validated on a sizable dataset comprising 1.52 million tweets from 9,551 users across major U.S. cities. The results indicate substantial improvements over existing algorithms, with city-level prediction accuracy reaching 64%, a notable increase from previous benchmarks. Additionally, state, time-zone, and region prediction accuracies were reported at 66%, 78%, and 71%, respectively. The analysis showed further enhancements when accounting for user mobility, with traveling users identified and excluded, yielding increases to 68% for cities, among other gains.

Several intriguing findings emerged from the research, such as the impact of geo-tagged tweets and explicit location mentions on prediction accuracy, and the role of temporal tweeting behavior in improving time-zone classification accuracy. This suggests the potential utility of integrating such temporal and contextual features into location prediction systems.

The implications of this research are twofold. Practically, the algorithm can enhance applications that rely on location data, such as event monitoring and localized marketing strategies. Theoretically, it highlights the importance of ensemble approaches and hierarchical frameworks in handling multi-granularity prediction tasks, providing a model that could be refined and extended to additional social media platforms and content types.

Overall, the paper offers a compelling methodology for addressing the sparse availability of reliable user location data on Twitter. Future research directions include the potential integration of social network features, exploration of even finer granularity predictions, and real-time adaptation of classifiers to streaming data. Such advancements could further enhance the adaptability and accuracy of location prediction in dynamic, data-rich environments.